Sankey diagrams for USGS gauge data in python(?)

This post was inspired by the Sankey diagram in Figure 1 of this pre-print led by Dave Gold:Exploring the Spatially Compounding Multi-sectoral Drought Vulnerabilities in Colorado’s West Slope River Basins” (Gold, Reed & Gupta, In Review) which features a Sankey diagram of flow contributions to Lake Powell. I like the figure, and thought I’d make an effort to produce similar diagrams using USGS gauge data.

Sankey diagrams show data flows between different source and target destinations. Lot’s of people use them to visualize their personal/business cashflows. It’s an obvious visualization option for streamflows.

To explain the “(?)” in my title: When I started this, I’d realized quickly that I to choose one of two popular plotting packages: matplotlib or plotly.

I am a frequent matplotlib user and definitely appreciate the level of control in the figure generation process. However, it can sometimes be more time- and line-intensive designing highly customized figures using matplotlib. On the other hand, in my experience, plotly tools can often produce appealing graphics with less code. I am also drawn to the fact that the plotly graphics are interactive objects rather than static figures.

I decided to go with plotly to try something new. If you want to hear my complaints and thoughts on use context, you can skip to the conclusions below.

In the sections below, I provide some code which will:

  • Define a network of USGS gauge stations to include in the plot
  • Retrieve data from USGS gauge stations
  • Create a Sankey diagram using plotly showing streamflows across the network

Here, I focus on the Rio Grande river upstream of Albuquerque, NM. However you can plot a different streamflow network by modifying the dictionary of upstream nodes defining the network.


Plotting a Sankey streamflow network with plotly

The code used here requires both plotly and the pygeohydro package (for USGS data retrieval).

from pygeohydro import NWIS
import plotly.graph_objects as go

With that out of the way, we can get started.

Defining the flow network & data retrieval

I start by defining a dictionary called upstream_stations which defines the relationships between different gauges of interest.

This dictionary contains pairs of the form: {"GAUGE_ID" : ["LIST_OF", "UPSTREAM", "GAUGE_IDs"]}

If there is no upstream site, then include an empty list. For the Rio Grande network, this looks like:

# define relationships between each gauge and upstream sites
upstream_stations = {
    '08329918' : ['08319000', '08328950'], 
    '08319000' : ['08317400', '08317200'],
    '08328950' : [],
    '08317200' : [],
    '08317400' : ['08313000'],
    '08313000' : ['08290000', '08279500'],
    '08287000' : [],
    '08279500' : [],
    '08290000' : ['08287000', '08289000'],
    '08289000' : [],
}

# Get list of all stations from upstream_stations
all_stations = list(upstream_stations.keys())
for station, upstream in upstream_stations.items():
    all_stations += upstream
all_stations = list(set(all_stations))

Notice that I also made a list containing all the stations IDs. I use the pygeohydro package from the HyRiver suite of tools to get retrieve the gauge station data (Chegini, Li, & Leung, 2021). I often cite this package, and have written about it in a past post (“Efficient hydroclimatic data accessing with HyRiver for Python”).

Using the list of all_stations, I use the following code to pull daily streamflow data for each site from 2015-2020 (or some other specified dates):

def get_usgs_gauge_data(stations, dates):
    """
    Get streamflow data from USGS gauge stations using NWIS.
    """
    nwis = NWIS()
    df = nwis.get_streamflow(stations, dates, mmd=False)
    
    # get rid of USGS- in columns
    df.columns = df.columns.str.replace('USGS-', '')
    return df

# Get USGS flows
flows = get_usgs_gauge_data(all_stations, ('2015-01-01', '2020-12-31'))

For the Sankey diagram, we need a single flow value for each station. In this case I calculate an average of the annual total flows at each station:

# Get annual mean flows
agg_flows = flows.resample('Y').sum().agg('mean')

Creating the Sankey figure

At it’s core, a Sankey diagram is a visualization of a weighted network (also referred to as a graph) defined by:

  • Nodes
  • Links (aka Edges)
  • Weights

In our case, the nodes are the USGS gauge stations, the links are the connections between upstream and downstream gauges, and the weights are the average volumes of water flowing from one gauge to the next.

Each link is defined by a source and target node and a value. This is where the upstream_stations dictionary comes in. In the code block below, I set up the nodes and links, looping through upstream_stations to define all of the source-target relationships:

## Deinfe nodes and links
# Nodes are station IDs
nodes = all_stations
node_indices = {node: i for i, node in enumerate(nodes)}

# make links based on upstream-downstream relationships
links = {
    'source': [],
    'target': [],
    'value': [],
}

# loop through upstream_stations dict
for station, upstream_list in upstream_stations.items():
    for stn in upstream_list:
        if stn in agg_flows and station in agg_flows:
            links['source'].append(node_indices[stn])
            links['target'].append(node_indices[station])
            links['value'].append(agg_flows[stn])

Lastly, I define some node labels and assign colors to each node. In this case, I want to make the nodes black if they represent reservoir releases (gauges at reservoir outlets) or blue if they are simple gauge stations.

labels = {
    '08329918' : 'Rio Grande at Alameda', 
    '08319000' : 'San Felipe Gauge',
    '08328950' : 'Jemez Canyon Reservoir',
    '08317200' : 'Santa Fe River',
    '08317400' : 'Cochiti Reservoir',
    '08313000' : 'Rio Grande at Otowi Bridge',
    '08287000' : 'Abiquiu Reservoir',
    '08279500' : 'Rio Grande',
    '08290000' : 'Rio Chama',
    '08289000' : 'Rio Ojo Caliente',
}

# Create nodes labels and colors lists
node_labels = [labels[node] for node in nodes]
node_colors = ['black' if 'Reservoir' in label else 'dodgerblue' for label in node_labels]


Finally, the function to generate the figure:

def create_sankey_diagram(node_labels, links, node_colors, 
						  orientation='h',
                          size=(2000, 700)):
    """
    Create a Sankey diagram of using Plotly.
    
    Parameters
    ----------
    node_labels : list
        List of node labels.
    links : dict
        Dictionary with keys 'source', 'target', and 'value'.
    node_colors : list
        List of node colors.
    orientation : str
        Orientation of the diagram, 'h' for horizontal and 'v' for vertical.
        
    Returns
    -------
    sankey_fig : plotly.graph_objects.Figure
        Plotly figure object.
    """
    sankey_fig = go.Figure(go.Sankey(
        orientation=orientation,
        node=dict(
            pad=70,
            thickness=45,
            line=dict(color='dodgerblue', width=0.5),
            label=node_labels,
            color=node_colors
        ),
        link=dict(
            source=links['source'],
            target=links['target'],
            value=links['value'],
            color='cornflowerblue'
        )
    ))
    
    sankey_fig.update_layout(
        title_text="Rio Grande Streamflow ",
        font=dict(size=23),
        width=size[0],
        height=size[1]
    )
    return sankey_fig

There are some options for manipulating this figure script to better suit your needs. Specifically you may want to modify:

  • pad=70 : this is the horizontal spacing between nodes
  • thickness=45 : this is the thickness of the node elements

With our pre-prepped data from above, we can use the function like so:

sankey_fig = create_sankey_diagram(node_labels, 
								   links, 
								   node_colors, 
								   orientation='v', size=(1000, 1200))
sankey_fig

And here we have it:

I’d say it looks… okay. And admittedly this is the appearance after manipulating the node placement using the interactive interface.

It’s a squished vertically (which can be improved by making it a much taller figure). However my biggest issue is with the text being difficult to read.

Changing the orientation to horizontal (orientation='h') results in a slightly better looking figure. Which makes sense, since the Sankey diagram is often shown horizontal. However, this does not preserve the relationship to the actual North-South flow direction in the Rio Grande, so I don’t like it as much.

Conclusions

To answer the question posed by the title, “Sankey diagrams for USGS gauge data in python(?)”: Yes, sometimes. And sometimes something else.

Plotly complaints: While working on this post, I developed a few complaints with the plotly Sankey tools. Specifically:

  • It appears that the label text coloring cannot be modified. I don’t like the white edgecolor/blur effect, but could not get rid of this.
  • The font is very difficult to read… I had to make the text size very large for it to be reasonably legible.
  • You can only assign a single node thickness. I had wanted to make the reservoirs thick, and shrink the size of the gauge station nodes. However, it seems this cannot be done.
  • The diagrams appear low-resolution and I don’t see a way to save a high res version.

Ultimately, the plotly tools are very restrictive in the design of the graphic. However, this is a tradeoff in order to get the benefit of interactive graphics.

Plotly praise: The plotly Sankey tools have some advantages, specifically:

  • The plots are interactive
  • Plots can be saved as HTML and embedded in websites

These advantages make the plotly tools good for anyone who might want to have a dynamic and maybe frequently updated dashboard on a site.

On the other hand, if I wanted to prepare a publication-quality figure, where I had absolute control of the design elements, I’d likely turn to matplotlib. That way it could be saved as an SVG and further manipulated in a vector art program link Inkscape or Illustrator.

Thanks for reading!

References

Chegini, T., Li, H. Y., & Leung, L. R. (2021). HyRiver: Hydroclimate data retriever. Journal of Open Source Software, 6(66), 3175.

Links

Figure Library Part 2 – Visualizations Supporting Synthetic Streamflow Diagnostics

Motivation

This is the second post in a series which was started by Dave last week and seeks to make a library of plotting functions for common visualization tasks. In this post, I share some plotting functions I have developed for synthetic streamflow diagnostic visualizations.

A thorough review of synthetic streamflow verification is outside the scope of this post, but in short we want to verify that the synthetic timeseries have statistical properties similar to the historic streamflow. Additionally, we want to see that the synthetic timeseries reflect the range of internal variability present in the basin by showing that a large synthetic ensemble envelopes the limited historic observations.

The following functions can be used to compare the following properties of historic and synthetic streamflow:

  • Flow magnitude
  • Non-exceedance probabilities (flow duration curves)
  • Autocorrelation
  • Spatial correlation across locations

Diagnostic plotting functions

  • plot_flow_ranges()
    • Range and median of flow values in historic and synthetic flow timeseries.
  • plot_fdc_ranges()
    • The range of annual flow duration curves (FDCs) for historic and synthetic streamflow timeseries.
  • plot_autocorrelation()
    • Comparison of historic and synthetic autocorrelation across a range of lag values.
  • plot_spatial_correlation()
    • Heatmap visualizations of Pearson correlation between different flow locations in the historic data and a single synthetic realization.

These functions are designed to accept historic (Qh) and synthetic (Qs) data with the following properties:

  • Qh : A pd.Series containing a single historic streamflow timeseries. The index of this data must be type pd.DatetimeIndex.
  • Qs : A pd.DataFrame containing many synthetic streamflow timeseries. The columns should each contain a single synthetic realization and the index must be type pd.DatetimeIndex. The column names do not matter.

Additionally, each function (excluding plot_fdc_ranges()) has a timestep keyword argument which changes the temporal scale of the flow being considered. Options are ‘daily’, ‘weekly’, or ‘monthly’, and the pandas.groupby() function is used to aggregate data accordingly to allow for comparison of statistics at different temporal scales. The default is ‘daily’.

Streamflow data

The historic data used in this example is taken from the Delaware River Basin. Specifically, streamflow data from the USGS gauge at Lordville, NY, on the upper portion of the Delaware River is used.

The synthetic streamflow was generated using the Kirsch-Nowak [1, 2] synthetic streamflow generator. For more detail on this generator, see Julie’s post here: Open Source Streamflow Generator Part 1.

I created 100 synthetic realizations of 30-year, daily streamflow timeseries at the Lordville gauge.

The historic and synthetic streamflow timeseries are assumed to start at the same date. Most of the plotting functions below do not explicitly consider the date associated with the timeseries, however comparisons between flow statistics across some range of time assume that the timeseries are aligned for the same dates.


Flow magnitude range

This first plot simply shows the range (max, min) and median of flows for both historic and synthetic datasets.

def plot_flow_ranges(Qh, Qs, 
                     timestep = 'daily',
                     units = 'cms', y_scale = 'log',
                     savefig = False, fig_dir = '.',
                     figsize = (7,5), colors = ['black', 'orange'],
                     title_addon = ""):
    """Plots the range of flow for historic and syntehtic streamflows for a specific timestep scale.
    
    Args:
        Qh (pd.Series): Historic daily streamflow timeseries. Index must be pd.DatetimeIndex. 
        Qs (pd.DataFrame): Synthetic daily streamflow timeseries realizations. Each column is a unique realization. Index must be pd.DatetimeIndex.
        timestep (str, optional): The timestep which data should be aggregated over. Defaults to 'daily'. Options are 'daily', 'weekly', or 'monthly'.
        units (str, optional): Streamflow units, for axis label. Defaults to 'cms'.
        y_scale (str, optional): Scale of the y-axis. Defaults to 'log'.
        savefig (bool, optional): Allows for png to be saved to fig_dir. Defaults to False.
        fig_dir (str, optional): Location of saved figure output. Defaults to '.' (working directory).
        figsize (tuple, optional): The figure size. Defaults to (4,4).
        colors (list, optional): List of two colors for historic and synthetic data respectively. Defaults to ['black', 'orange'].
        title_addon (str, optional): Text to be added to the end of the title. Defaults to "".
    """

    # Assert formatting matches expected
    assert(type(Qh.index) == pd.DatetimeIndex), 'Historic streamflow (Qh) should have pd.DatatimeIndex.'
    assert(type(Qs.index) == pd.DatetimeIndex), 'Synthetic streamflow (Qh) should have pd.DatatimeIndex.'

    # Handle conditional datetime formatting
    if timestep == 'daily':
        h_grouper = Qh.index.dayofyear
        s_grouper = Qs.index.dayofyear
        x_lab = 'Day of the Year (Jan-Dec)'
    elif timestep == 'monthly':
        h_grouper = Qh.index.month
        s_grouper = Qs.index.month
        x_lab = 'Month of the Year (Jan-Dec)'
    elif timestep == 'weekly':
        h_grouper = pd.Index(Qh.index.isocalendar().week, dtype = int)
        s_grouper = pd.Index(Qs.index.isocalendar().week, dtype = int)
        x_lab = 'Week of the Year (Jan-Dec)'
    else:
        print('Invalid timestep input. Options: "daily", "monthly", "weekly".')
        return

    # Find flow ranges
    s_max = Qs.groupby(s_grouper).max().max(axis=1)
    s_min = Qs.groupby(s_grouper).min().min(axis=1)
    s_median = Qs.groupby(s_grouper).median().median(axis=1)
    h_max = Qh.groupby(h_grouper).max()
    h_min = Qh.groupby(h_grouper).min()
    h_median = Qh.groupby(h_grouper).median()
  
    ## Plotting  
    fig, ax = plt.subplots(figsize = figsize, dpi=150)
    xs = h_max.index
    ax.fill_between(xs, s_min, s_max, color = colors[1], label = 'Synthetic Range', alpha = 0.5)
    ax.plot(xs, s_median, color = colors[1], label = 'Synthetic Median')
    ax.fill_between(xs, h_min, h_max, color = colors[0], label = 'Historic Range', alpha = 0.3)
    ax.plot(xs, h_median, color = colors[0], label = 'Historic Median')
    
    ax.set_yscale(y_scale)
    ax.set_ylabel(f'{timestep.capitalize()} Flow ({units})', fontsize=12)
    ax.set_xlabel(x_lab, fontsize=12)
    ax.legend(ncols = 2, fontsize = 10, bbox_to_anchor = (0, -.5, 1.0, 0.2), loc = 'upper center')    
    ax.grid(True, linestyle='--', linewidth=0.5)
    ax.set_title(f'{timestep.capitalize()} Streamflow Ranges\nHistoric & Synthetic Timeseries at One Location\n{title_addon}')
    plt.tight_layout()
    
    if savefig:
        plt.savefig(f'{fig_dir}/flow_ranges_{timestep}.png', dpi = 150)
    return

Assuming I have loaded a pd.DataFrame named Qs which contains synthetic timeseries in each column, and a single historic timeseries Qh, I can quickly plot daily (or weekly or monthly) flow ranges using:

# Useage
plot_flow_ranges(Qh, Qs, timestep='daily')

This figure provides a lot of useful information about our synthetic ensemble:

  • The synthetic ensembles envelope the limited historic values, and include more extreme high and low flows
  • The seasonal trends in the synthetics align with the historic
  • The median of the synthetic ensemble aligns very closely with the median of the historic data.

Flow duration curve ranges

The plot_fdc_ranges() function will calculate an FDC for each year of the historic and synthetic data, and plot the ranges that these FDCs take. It will also "total" FDCs calculated for the entirety of the historic or synthetic records.

def plot_fdc_ranges(Qh, Qs, 
                    units = 'cms', y_scale = 'log',
                    savefig = False, fig_dir = '.',
                    figsize = (5,5), colors = ['black', 'orange'],                   
                    title_addon = ""):
    """Plots the range and aggregate flow duration curves for historic and synthetic streamflows.
    
    Args:
        Qh (pd.Series): Historic daily streamflow timeseries. Index must be pd.DatetimeIndex. 
        Qs (pd.DataFrame): Synthetic daily streamflow timeseries realizations. Each column is a unique realization. Index must be pd.DatetimeIndex.
        units (str, optional): Streamflow units, for axis label. Defaults to 'cms'.
        y_scale (str, optional): Scale of the y-axis. Defaults to 'log'.
        savefig (bool, optional): Allows for png to be saved to fig_dir. Defaults to False.
        fig_dir (str, optional): Location of saved figure output. Defaults to '.' (working directory).
        figsize (tuple, optional): The figure size. Defaults to (4,4).
        colors (list, optional): List of two colors for historic and synthetic data respectively. Defaults to ['black', 'orange'].
        title_addon (str, optional): Text to be added to the end of the title. Defaults to "".
    """
    
    ## Assertions
    assert(type(Qs) == pd.DataFrame), 'Synthetic streamflow should be type pd.DataFrame.'
    assert(type(Qh.index) == pd.DatetimeIndex), 'Historic streamflow (Qh) should have pd.DatatimeIndex.'
    assert(type(Qs.index) == pd.DatetimeIndex), 'Synthetic streamflow (Qh) should have pd.DatatimeIndex.'

    
    # Calculate FDCs for total period and each realization
    nonexceedance = np.linspace(0.0001, 0.9999, 50)
    s_total_fdc = np.quantile(Qs.values.flatten(), nonexceedance)
    h_total_fdc = np.quantile(Qh.values.flatten(), nonexceedance) 
    
    s_fdc_max = np.zeros_like(nonexceedance)
    s_fdc_min = np.zeros_like(nonexceedance)
    h_fdc_max = np.zeros_like(nonexceedance)
    h_fdc_min = np.zeros_like(nonexceedance)

    annual_synthetics = Qs.groupby(Qs.index.year)
    annual_historic = Qh.groupby(Qh.index.year)

    for i, quant in enumerate(nonexceedance):
            s_fdc_max[i] = annual_synthetics.quantile(quant).max().max()
            s_fdc_min[i] = annual_synthetics.quantile(quant).min().min()
            h_fdc_max[i] = annual_historic.quantile(quant).max()
            h_fdc_min[i] = annual_historic.quantile(quant).min()
    
    ## Plotting
    fig, ax = plt.subplots(figsize=figsize, dpi=200)
    
    #for quant in syn_fdc_quants:
    ax.fill_between(nonexceedance, s_fdc_min, s_fdc_max, color = colors[1], label = 'Synthetic Annual FDC Range', alpha = 0.5)
    ax.fill_between(nonexceedance, h_fdc_min, h_fdc_max, color = colors[0], label = 'Historic Annual FDC Range', alpha = 0.3)

    ax.plot(nonexceedance, s_total_fdc, color = colors[1], label = 'Synthetic Total FDC', alpha = 1)
    ax.plot(nonexceedance, h_total_fdc, color = colors[0], label = 'Historic Total FDC', alpha = 1, linewidth = 2)

    ax.grid(True, linestyle='--', linewidth=0.5)
    ax.set_yscale(y_scale)
    ax.set_ylabel(f'Flow ({units})')
    ax.set_xlabel('Non-Exceedance Probability')
    ax.legend(fontsize= 10)
    ax.grid(True, linestyle='--', linewidth=0.5)
    
    plt.title(f'Flow Duration Curve Ranges\nHistoric & Synthetic Streamflow\n{title_addon}')
    if savefig:
        plt.savefig(f'{fig_dir}/flow_duration_curves_{title_addon}.png', dpi=200)
    plt.show()
    return

## Usage
plot_fdc_ranges(Qh, Qs)

Similar to the flow range plot above, we see that the synthetic ensemble is nicely enveloping the historic data; there are synthetic instances where the extreme flows in a particular year are more extreme than historic observations. However, the FDC of the total synthetic ensemble (if all synthetics were combined) is almost exactly equal to the total historic FDC.

Autocorrelation

Hydrologic systems are known to have memory, or correlation between streamflow at different points in time. We want to make sure that our synthetic streamflow timeseries have similar memory or autocorrelation.

The plot_autocorrelation() function will calculate autocorrelation for the historic record, and each synthetic realization for a range of different lag values. In addition to the required Qh and Qs arguments this function requires lag_range which should be a list or array of lag values to be considered.

def plot_autocorrelation(Qh, Qs, lag_range, 
                         timestep = 'daily',
                         savefig = False, fig_dir = '.',
                         figsize=(7, 5), colors = ['black', 'orange'],
                         alpha = 0.3):
    """Plot autocorrelation of historic and synthetic flow over some range of lags.

    Args:
        Qh (pd.Series): Historic daily streamflow timeseries. Index must be pd.DatetimeIndex. 
        Qs (pd.DataFrame): Synthetic daily streamflow timeseries realizations. Each column is a unique realization. Index must be pd.DatetimeIndex.
        lag_range (iterable): A list or range of lag values to be used for autocorrelation calculation.
        
        timestep (str, optional): The timestep which data should be aggregated over. Defaults to 'daily'. Options are 'daily', 'weekly', or 'monthly'.
        savefig (bool, optional): Allows for png to be saved to fig_dir. Defaults to False.
        fig_dir (str, optional): Location of saved figure output. Defaults to '.' (working directory).
        figsize (tuple, optional): The figure size. Defaults to (4,4).
        colors (list, optional): List of two colors for historic and synthetic data respectively. Defaults to ['black', 'orange'].
        alpha (float, optional): The opacity of synthetic data. Defaults to 0.3.
    """
    
    ## Assertions
    assert(type(Qs) == pd.DataFrame), 'Synthetic streamflow should be type pd.DataFrame.'
    assert(type(Qh.index) == pd.DatetimeIndex), 'Historic streamflow (Qh) should have pd.DatatimeIndex.'
    assert(type(Qs.index) == pd.DatetimeIndex), 'Synthetic streamflow (Qh) should have pd.DatatimeIndex.'

    if timestep == 'monthly':
        Qh = Qh.resample('MS').sum()
        Qs = Qs.resample('MS').sum()
        time_label = 'months'
    elif timestep == 'weekly':
        Qh = Qh.resample('W-SUN').sum()
        Qs = Qs.resample('W-SUN').sum()
        time_label = f'weeks'
    elif timestep == 'daily':
        time_label = f'days'
        
    # Calculate autocorrelations
    autocorr_h = np.zeros(len(lag_range))
    confidence_autocorr_h = np.zeros((2, len(lag_range)))
    for i, lag in enumerate(lag_range):
        h_corr = pearsonr(Qh.values[:-lag], Qh.values[lag:])
        autocorr_h[i] = h_corr[0]
        confidence_autocorr_h[0,i] = h_corr[0] - h_corr.confidence_interval().low
        confidence_autocorr_h[1,i] = h_corr.confidence_interval().high - h_corr[0]
    
    autocorr_s = np.zeros((Qs.shape[1], len(lag_range)))
    for i, realization in enumerate(Qs.columns):
        autocorr_s[i] = [pearsonr(Qs[realization].values[:-lag], Qs[realization].values[lag:])[0] for lag in lag_range]


    ## Plotting
    fig, ax = plt.subplots(figsize=figsize, dpi=200)

    # Plot autocorrelation for each synthetic timeseries
    for i, realization in enumerate(Qs.columns):
        if i == 0:
            ax.plot(lag_range, autocorr_s[i], alpha=1, color = colors[1], label=f'Synthetic Realization')
        else:
            ax.plot(lag_range, autocorr_s[i], color = colors[1], alpha=alpha, zorder = 1)
        ax.scatter(lag_range, autocorr_s[i], alpha=alpha, color = 'orange', zorder = 2)

    # Plot autocorrelation for the historic timeseries
    ax.plot(lag_range, autocorr_h, color=colors[0], linewidth=2, label='Historic', zorder = 3)
    ax.scatter(lag_range, autocorr_h, color=colors[0], zorder = 4)
    ax.errorbar(lag_range, autocorr_h, yerr=confidence_autocorr_h, 
                color='black', capsize=4, alpha=0.75, label ='Historic 95% CI')

    # Set labels and title
    ax.set_xlabel(f'Lag ({time_label})', fontsize=12)
    ax.set_ylabel('Autocorrelation (Pearson)', fontsize=12)
    ax.set_title('Autocorrelation Comparison\nHistoric Timeseries vs. Synthetic Ensemble', fontsize=14)

    ax.legend(fontsize=10)
    ax.grid(True, linestyle='--', linewidth=0.5)
    plt.subplots_adjust(left=0.12, right=0.95, bottom=0.12, top=0.92)

    if savefig:
        plt.savefig(f'{fig_dir}/autocorrelation_plot_{timestep}.png', dpi=200, bbox_inches='tight')
    plt.show()
    return

Let’s visualize the autocorrelation of daily streamflow for a maximum of 100-day lag.

## Usage
plot_autocorrelation(Qh, Qs, lag_range=range(1,100, 5), timestep='daily') 

Spatial Correlation

The plot_spatial_correlation() function generates a heatmap visualization of the Pearson correlation matrix for streamflow at different locations in both the historic and synthetic datasets.

The plot_spatial_correlation() function is different from the other three functions shown above, since it requires a set of timeseries data for multiple different locations.

Rather than providing a set of many different realizations for a single site we need to provide a single realization of synthetics across all sites of interest, with a timeseries for each site in each column. This new input is Qs_i, and must the same columns as Qh (they both must contain a timeseries for each location.

def plot_correlation(Qh, Qs_i,
                        timestep = 'daily',
                        savefig = False, fig_dir = '.',
                        figsize = (8,4), color_map = 'BuPu',
                        cbar_on = True):
    """Plots the side-by-side heatmaps of flow correlation between sites for historic and synthetic multi-site data.
    
    Args:
        Qh (pd.DataFrame): Historic daily streamflow timeseries at many different locations. The columns of Qh and Qs_i must match, and index must be pd.DatetimeIndex. 
        Qs (pd.DataFrame): Synthetic daily streamflow timeseries realizations. Each column is a unique realization. The columns of Qh and Qs_i must match, and index must be pd.DatetimeIndex.
        timestep (str, optional): The timestep which data should be aggregated over. Defaults to 'daily'. Options are 'daily', 'weekly', or 'monthly'.
        savefig (bool, optional): Allows for png to be saved to fig_dir. Defaults to False.
        fig_dir (str, optional): Location of saved figure output. Defaults to '.' (working directory).
        figsize (tuple, optional): The figure size. Defaults to (4,4).
        color_map (str, optional): The colormap used for the heatmaps. Defaults to 'BuPu.
        cbar_on (bool, optional): Indictor if the colorbar should be shown or not. Defaults to True.
    """
    
    ## Assertions
    assert(type(Qh) == type(Qs_i) and (type(Qh) == pd.DataFrame)), 'Both inputs Qh and Qs_i should be pd.DataFrames.'
    assert(np.mean(Qh.columns == Qs_i.columns) == 1.0), 'Historic and synthetic data should have same columns.'
    
    if timestep == 'monthly':
        Qh = Qh.resample('MS').sum()
        Qs_i = Qs_i.resample('MS').sum()
    elif timestep == 'weekly':
        Qh = Qh.resample('W-SUN').sum()
        Qs_i = Qs_i.resample('W-SUN').sum()
    
    ## Plotting
    fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=figsize, dpi = 250)
    
    # Heatmap of historic correlation
    h_correlation = np.corrcoef(Qh.values.transpose())
    sns.heatmap(h_correlation, ax = ax1, 
                square = True, annot = False,
                xticklabels = 5, yticklabels = 5, 
                cmap = color_map, cbar = False, vmin = 0, vmax = 1)

    # Synthetic correlation
    s_correlation = np.corrcoef(Qs_i.values.transpose())
    sns.heatmap(s_correlation, ax = ax2,
                square = True, annot = False,
                xticklabels = 5, yticklabels = 5, 
                cmap = color_map, cbar = False, vmin = 0, vmax = 1)
    
    for axi in (ax1, ax2):
        axi.set_xticklabels([])
        axi.set_yticklabels([])
    ax1.set(xlabel = 'Site i', ylabel = 'Site j', title = 'Historic Streamflow')
    ax2.set(xlabel = 'Site i', ylabel = 'Site j', title = 'Synthetic Realization')
    title_string = f'Pearson correlation across different streamflow locations\n{timestep.capitalize()} timescale\n'
        
    # Adjust the layout to make room for the colorbar
    if cbar_on:
        plt.tight_layout(rect=[0, 0, 0.9, 0.85])
        cbar_ax = fig.add_axes([0.92, 0.1, 0.02, 0.6])  # [left, bottom, width, height]
        cbar = fig.colorbar(ax1.collections[0], cax=cbar_ax)
        cbar.set_label("Pearson Correlation")
    plt.suptitle(title_string, fontsize = 12)
    
    if savefig:
        plt.savefig(f'{fig_dir}/spatial_correlation_comparison_{timestep}.png', dpi = 250)
    plt.show()
    return

To use this function, I need to load a different pd.DataFrame containing synthetic timeseries at all of my study locations.

## Usage
single_synthetic_all_sites = pd.read_csv('synthetic_streamflow_single_realization_all_sites.csv', sep=',', index_col=0, parse_dates=True)

plot_correlation(Qh_all_sites, single_synthetic_all_sites, timestep='weekly')

Across the different streamflow sites, we see that the synthetic timeseries do a decent job at preserving spatial correlation, although correlation is decreased for sites which are not strongly correlated in the historic record.

Conclusions

I hope these functions save you some time in your own work, and allow you expand your synthetic diagnostic considerations.

References

[1] Kirsch, B. R., Characklis, G. W., & Zeff, H. B. (2013). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management139(4), 396-406.

[2] Nowak, K., Prairie, J., Rajagopalan, B., & Lall, U. (2010). A nonparametric stochastic approach for multisite disaggregation of annual to daily streamflow. Water Resources Research46(8).

Radial convergence diagram (aka chord diagram) for sensitivity analysis results and other inter-relationships between data

TLDR; Python script for radial convergence plots that can be found here.

You might have encountered this type of graph before, they’re usually used to present relationships between different entities/parameters/factors and they typically look like this:

From https://datavizcatalogue.com/methods/non_ribbon_chord_diagram.html

In the context of our work, I have seen them used to present sensitivity analysis results, where we are interested in both the individual significance of a model parameter, but also the extent of its interaction with others. For example, in Butler et al. (2014) they were used to present First, Second, and Total order parameter sensitivities as produced by a Sobol’ Sensitivity Analysis.

From Butler et al. (2014)

I set out to write a Python script to replicate them. Calvin Whealton has written a similar script in R, and the same functionality also exists within Rhodium. I just wanted something with a bit more flexibility, so I wrote this script that produces two types of these graphs, one with straight lines and one with curved (which are prettier IMO). The script takes dictionary items as inputs, either directly from SALib and Rhodium (if you are using it to display Sobol results), or by importing them (to display anything else). You’ll need one package to get this to run: NetworkX. It facilitates the organization of the nodes in a circle and it’s generally a very stable and useful package to have.

Graph with straight lines
Graph with curved lines

I made these graphs to display results the results of a Sobol analysis I performed on the model parameters of a system I am studying (a, b, c, d, h, K, m, sigmax, and sigmay). The node size indicates the first order index (S1) per parameter, the node border thickness indicates the total order index (ST) per parameter, and the thickness of the line between two nodes indicates the secord order index (S2). The colors, thicknesses, and sizes can be easily changed to fit your needs. The script for these can be found here, and I will briefly discuss what it does below.

After loading the necessary packages (networkx, numpy, itertools, and matplotlib) and data, there is some setting parameters that can be adapted for the figure generation. First, we can define a significance value for the indices (here set to 0.01). To keep all values just set it to 0. Then we have some stylistic variables that basically define the thicknesses and sizes for the lines and nodes. They can be changed to get the look of the graph to your liking.

# Set min index value, for the effects to be considered significant
index_significance_value = 0.01
node_size_min = 15 # Max and min node size
node_size_max = 30
border_size_min = 1 # Max and min node border thickness
border_size_max = 8
edge_width_min = 1 # Max and min edge thickness
edge_width_max = 10
edge_distance_min = 0.1 # Max and min distance of the edge from the center of the circle
edge_distance_max = 0.6 # Only applicable to the curved edges

The rest of the code should just do the work for you. It basically does the following:

  • Define basic variables and functions that help draw circles and curves, get angles and distances between points
  • Set up graph with all parameters as nodes and draw all second order (S2) indices as lines (edges in the network) connecting the nodes. For every S2 index, we need a Source parameter, a Target parameter, and the Weight of the line, given by the S2 index itself. If you’re using this script for other data, different information can fit into the line thickness, or they could all be the same.
  • Draw nodes and lines in a circular shape and adjust node sizes, borders, and line thicknesses to show the relative importance/weight. Also, annotate text labels on each node and adjust their location accordingly. This produces the graph with the straight lines.
  • For the graph with the curved lines, define function that will generate the x and y coordinates for them, and then plot using matplotlib.

I would like to mention this script by Enrico Ubaldi, based on which I developed mine.

Interacting with Plotted Functions Using Jupyter Notebooks ipywidgets with matplotlib

Interacting with Plotted Functions Using Jupyter Notebooks ipywidgets with matplotlib

When exploring the properties of an individual function or a system of equations, I often find myself quickly writing up code in a Jupyter Notebook (overview) to plot functions using matplotlib. While a previous blogpost by Jazmin has gone over creating interactive plots with matplotlib, being able to actively change variables to quickly explore their sensitivity or even teach others is crucial.

The primary reason I started playing with interactive widgets in Jupyter Notebook was to  teach individuals the basic mechanics of a simple crop allocation model that utilizes Positive Mathematical Programming (PMP) (Howitt, 1995). Beyond showing the simple calibration steps via code, allowing students to interact with individual variables allows for them to not only explore the sensitivity of models but also to find where such a model might break.

In this specific case, the marginal cost and revenues for a specific commodity (corn) versus the number of acres planted are graphed. A breaking case where the crop’s input data (the price per ton of corn) causes the model to produce unrealistic results (a negative marginal cost) is produced below.

Plotting Functions in matplotlib

One of the features that is often overlooked in matplotlib is that you can (indirectly) plot functions without much effort. The following commented code is used to demonstrate plotting marginal cost and revenue curves on the same graph.

import matplotlib.pyplot as plt
from numpy import *

#fixed inputs from calibrated model
#alpha and gamma for quadratic formulation from Howitt, 1995
input_params = [743.0, 2.57]

#crop revenue per acre, known
crop_revenues = 1500

#Creating the x-axis domain, controls range of x inputs
t = linspace(0, 350)

#t is the independent variable, corn_MC is dependent variable
corn_MC = input_params[0]  + input_params[1] * t 

#note that you have to multiply t by 0 to populate entire line
corn_MR = crop_revenues + 0 * t

plt.figure(dpi=400) #set resolution of graph, can change. 

#label axes
plt.ylabel('Marginal Cost ($/acre)')
plt.xlabel('Acres of Corn Planted')

#plot as many lines as you'd like, expand them here
plt.plot(t, corn_MC, 'r', label='Marginal Cost')
plt.plot(t, corn_MR, 'b', label='Marginal Revenue')

#change size of ticks on axes
plt.tick_params(labelsize=8)

legend = plt.legend(loc=4, shadow=True)
plt.show()

 

The resulting graph is generated inline in a Jupyter Notebook. Note that the resolution and size of the graphic is adjusted using plot.figure(dpi=400)—the default is much smaller and makes for hard-to-read graphs!

matplotlib_output.png

Interactive Inputs for Graphics in Jupyter Notebook

While graphing a function in Jupyter Notebook using matplotlib is rather straightforward, interactive plots with functions are rarely utilized. Using iPython widgets allows us to easily create these tools for any type of situation, from vertical and horizontal slider bars to progress bars to dropdown to boolean ‘click me’ buttons. You can find any of these features here. For the sake of simplicity, I have simply included three slider bars to demonstrate these features.

Notably, as an expansion of the example above, I have to run scipy.minimize multiple times whenever inputs are recalibrated (for questions on this end, I am more than happy to explain PMP if you contact me directly). To do this, I have to imbed multiple functions into a larger function to carry this out. However, I have not noticed any considerable slowdowns when running this much code.

To carry out the entirety of the example above, I have almost 300 lines of code to fully encapsulate a 3-crop allocation problem. For simplicity’s sake, I defined the function ‘pmp_example_sliders’ to contain all of the code required to calibrate and produce the graphic above (note that input parameters are what are calculated) using Scipy.optimize.minimize. Shadow prices are then derived using scipy.optimize.minimize and used to create the parameters necessary for the  marginal cost curve shown in this example. Note that the full model can be found in the GitHub file.

from ipywidgets import *
import scipy.optimize
%matplotlib inline
from numpy import *
import matplotlib.pyplot as plt

def pmp_example_sliders(corn_price=250, corn_initial_acres=200, x_axis_scale=350):
return plot
###Code for Running Calibration and Graphing Resulting Graph###
#create sliders for three variables to be changed
#have to have function with changing parameters (shown above)
#note that scale for corn_price is form 10 to 500 and stepping by 10
slider = interactive(pmp_example_sliders, corn_price=(10, 500, 10),
corn_initial_acres=(0.01,400, 10), x_axis_scale=(30, 650, 10))

#display the resulting slider
display(slider)

The following graphic results with the aforementioned sliders:

interact_1.PNG

We can easily change the input slider on the corn price from 250 to 410 to change to the following results:

interact_2.PNG

Because of this interactive feature, we not only see how the MC and MR curves interact, but we can observe that the Alpha Parameter (shown below the graph) becomes negative. While not important for this specific blog post, it shows where the crop allocation model produces negative marginal costs for low allocations of corn. Thus, utilizing interactive graphics can help highlight dynamics and drawbacks/limitations for models, helping assist in standalone tutorials.

If you would like to run this yourself, please feel free to download the Jupyter Notebooks from my GitHub. All code is written in Python 3. If you need assistance opening Jupyter Notebook, please refer to my blog post introducing Jupyter Notebook (link) or how to create a shortcut to launch Jupyter Notebook from your current working directory (link).

Tips for Creating Animated Graphics (GIFs)

Tips for Creating Animated Graphics (GIFs)

Creating a silent movie to tell a narrative is a powerful yet challenging endeavor. Perhaps with inspirations to create an eye-catching graphic for your social media followers or to better visualize intertemporal and interspatial changes, one can utilize GIFs to best meet their goals. However, with great (computational functionality and) power comes great responsibility. Creating a clean, minimalistic, and standalone graphic that quickly conveys information is key, so knowing what tips to follow along with a few linked tutorials will be a good starting point for creating your own effective animated visual.

I recently utilized a GIF in a presentation to show changes in perennial crop coverage in Kern County, CA alongside changes in drought intensity. While the GIF (shown at the end) wasn’t perfect, general tips for animation creation along with a quick critique to point out specific issues in the generated animation can help you when you plan out the creation of your own animation.

Considerations and Tips for Image Quality Control

Making an effective static graphic that easily conveys information without being overwhelming is tough, but creating a GIF can lead to disaster even faster than the blink of an eye. The most straightforward way to create your animation is to generate a incremental series of individual images you will be combining to create a short movie.

The main considerations for creating an effective movie visual is ensuring it is readable, engaging, and informative. The first question you should ask is whether the story you want to tell can be fully and effectively explained with a static image. If it can be, stop and make it before wasting your time on creating an animation.

Assuming you’ve passed the first step, it’s time to storyboard your visual. Draw out what you’re wanting to show so you have a template to follow. (If you find that what you’re wanting to animate can be effectively told with a static image, please revisit the previous paragraph.)

Much like a paper, there should be an introduction or orienting feature that allows the consumer to easily understand the thesis idea of the graphic at a glance in a standalone setting. With this key point in mind, driving the central theme of your graphic (likely) allows for the visual to appear randomly in public without fear of it being without a caption to fully tell a story.

Here is an example of a well-made, informative, and engaging GIF that shows how to make effective tables:ZY8dKpA

 

It is clear and to the point, stepping the audience through a range of steps in a clear and entertaining manner. It is well labeled, showing the author’s website, and is very concise in its messaging. Furthermore, it has been very memorable for me after initially seeing it on Twitter years ago.

This website has plenty of examples for effective animations that you could use for inspiration. Additionally, for those Pinterest users out there, DataGIFs has plenty of good examples. On Reddit, plenty of beautiful (and ugly) animated and static graphics alike appear on /r/dataisbeautiful.  If you have a narrative to tell, there are likely similar animations that exist that you can use as a base to convey your tale.

With the big picture stuff out of the way, the next few points to consider can make or break your animation. While not every tip applies to every situation, covering them helps ensure your visual doesn’t have any glaring issues.

  • Looping and Timing
    • Make sure the timing of the gif isn’t too long or short.
      • 20 seconds is generally the maximum recommended length for audience engagement
      • Make sure there is enough time on each component of the GIF to be understood within two to three loops
    • Ensure the individual frames can be read in a timely fashion
    • Don’t have a jarring transition from the end of the GIF back to the start
      • Make sure the first and last frames match up in style
      • You can pull the user away from the original frame, but bring them back in since they’ll likely repeat the loop two to three times
      • Consider starting with a base slide
      • Individually add features so individuals are not overwhelmed
      • Much like a slide presentation, if the reader is overwhelmed with too much information from the start, they are likely not going to understand the progression of the story being conveyed
    • Slow is boring and may cause people to prematurely end their engagement with the animation
  • Content Layout
    • Consider the story you want to portray
    • Much like any figure, ensuring the animation can stand alone and be posted online while out of context without killing its meaning is essential
    • Select a “master frame” that is descriptive enough to be the “title slide” of the GIF—this is what will be seen when your slide deck is printed
  • Image Manipulation
  • Consistent sizing and coloring
    • Ensure labels are readable and are in the same location through all of the images. Having labels twitching between locations is distracting
    • Framing of the edges of objects should not drift. If making a transition between different sizing (e.g. zooming out), ensure it is incremental and not shocking
    • Customize components of a graphic to avoid bland experiences
  • Effective colors
    • Unless you’re artistically talents (i.e. not me), stick to generated color palates. ColorBrewer is a good resource
    • Consider what your image might look like when printed out—try to make sure it converts well to grayscale. Note that generally the first image of a GIF’s sequence is what will be printed
    • Watch out for colorblind individuals. They are people too and are worth catering to (use this website to ensure your images are compatible)
    • The fewer, the better: selection of individual colors to make certain messages “pop” is worthwhile
  • Fonts and Labels
    • Give credit to yourself within the graphic
    • Make sure that all words can be read at a glance
    • If you plan to post the gif online, make sure individuals can get a decent idea of what’s going on before they blow it up
    • Assume your image will be posted on twitter: consider making a dummy (private) account to make sure everything important is readable from the preview along with the actual gif being readable from a regular smartphone
    • If distributing online, ensure the animation can be linked to a central website for citations/narratives/methods
  • Resolution and Size
    • Bigger is rarely better. Try to limit your size to around 1 MB max
    • Size your GIF based on your platform. The size is not nearly as important if you’re using it in a presentation you keep on a jump drive compared to if you’re going to upload it online.
    • Test your visual on multiple platforms (i.e. PowerPoint on another computer after moving the .ppt file) prior to use.
    • If posting something that will be shared on Twitter, ensure the narrative of the GIF can be transferred when consumed on a mobile device. (~80% of all Twitter interactions come from a mobile devices)
  • Creating the GIF
    • Make sure your images aren’t too high of a resolution going into GIMP. It turns out that bigger isn’t always better
    • I have run into issues with Python-based libraries creating GIFs with artifacts.
      • imageio is commonly considered to be the best option when creating GIFs using a Python due to its compatibility with PIL/PILLOW (source)
    • I am a firm believer in spending a bit of extra time to follow this tutorial using GIMP to create your GIF
      • Photoshop is the standard (tutorial), but it is not open source
    • This post does a good job on showing how to make animated GIFs in R 

Crop-Coverage Animation Critique

As mentioned above, I created a GIF to show changes in perennial crop coverage in Kern County, CA alongside changes in drought intensity. Each frame of the map was created in ArcGIS (which will eventually be migrated over to Python using Julie’s tutorial) and exported as a PNG while the graph along the bottom is a screenshot from the U.S. Drought Monitor. These individual images were combined, a black rectangle overlaying the graph was added, and then saved as a single image for each given year. I then used GIMP to combine, order, and create a GIF (tutorial link).

2000_to_2017_export2.gifThe Good:

  1. The timing of the picture sequence (~650 ms between frames) makes for easy reading
  2. The box highlighting the given year’s drought conditions along the bottom graphic is effective at pointing out current conditions
  3. Coupling the moving box with a clear title indicating the current year shows a natural progression through time
  4. It is relatively straightforward to see where in California Kern County is located
    • The map includes a scale and north arrow to assist with orientation
    • The pop-out box for California isn’t very intrusive
  5. The title is large and easy to read, with the given year being clearly indicated
  6. The data for the drought monitor is labeled as required by the source

The Bad:

  1. The title shifts a few pixels in a couple of frames
  2. The graph does not stretch to match the size of the entire GIF, making the graphic feel unbalanced
  3. There is too much white space in random locations
    • The map itself could have been zoomed in a bit closer to leave out a lot of the white space in the eastern part of the county
    • Having too much white space on the right side of the map makes the graphic feel even further imbalanced
  4. There is not a website on the GIF to indicate the author/creator (me)

The Ugly:

  1. The choice in color schemes on the map jump out in a bad way
    • The districts drawn on the background map cannot be easily differentiated
    • Using red as a plotted color can be a bit jarring and is not colorblind friendly
  2. The labels in the legend hop around and are not consistent for three (3) of the years
    • The “master” or initial blank slide only has three categories while the rest of the slides have four
  3. The labeling for the graph is not clear (title, axes labels)
  4. The axis labels are blurry and small

Code showing how to make the drought monitor graphic graphic (with updates) is posted at this GitHub link. Note that the images were compiled into a GIF using GIMP.  The revised graphic is shown below.

kern_drought_upload_blog_post.gif

High definition images can be found on imgur here.

Map making in Matlab

Map making in Matlab

Greetings,

This weeks post will cover the basics of generating maps in Matlab.  Julie’s recent post showed how to do some of this in Python, but, Matlab is also widely used by the community.  You can get a lot done with Matlab, but in this post we’ll just cover a few of the basics.

We’ll start off by plotting a map of the continental United States, with the states.  We used three  this with three commands: usamap, shaperead, and geoshow.  usamap creates an empty map axes having the Lambert Projection covering the area of the US, or any state or collection of states.  shaperead reads shapefiles (duh) and returns a Matlab geographic data structure, composed of both geographic data and attributes.  This Matlab data structure then interfaces really well with various Matlab functions (duh).  Finally, geoshow plots geographic data, in our case on the map axes we defined.  Here’s some code putting it all together.

hold on
figure1 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
states = shaperead('usastatehi',...
 'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
tightmap
hold off

Note that ‘usastatehi’ is a shapefile containing the US states (duh) that’s distributed with Matlab. The above code generates this figure:

conus_blank

Now, suppose we wanted to plot some data, say a precipitation forecast, on our CONUS map.  Let’s assume our forecast is being made at many points (lat,long).  To interpolate between the points for plotting we’ll use Matlab’s griddata function.  Once we’ve done this, we use the Matlab’s contourm command.  This works exactly like the normal contour function, but the ‘m’ indicates it plots map data.

xi = min(x):0.5:max(x);
yi = min(y):0.5:max(y);
[XI, YI] = meshgrid(xi,yi);
ZI = griddata(x,y,V,XI,YI);

hold on
figure2 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
states = shaperead('usastatehi',...
 'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])

contourm(YI,-1*XI,ZI)
tightmap
hold off

Here x, y, and V are vectors of long, lat, and foretasted precipitation respectively.  This code generates the following figure:

conus_contour

Wow!  Louisiana is really getting hammered!  Let’s take a closer look.  We can do this by changing the entry to usamap to indicate we want to consider only Louisiana.  Note, usamap accepts US postal code abbreviations.

ax = usamap('LA');

Making that change results in this figure:

LA_contour

Neat!  We can also look at two states and add annotations.  Suppose, for no reason in particular, you’re interested in the location of Tufts University relative to Cornell.  We can make a map to look at this with the textm and scatterm functions.  As before, the ‘m’ indicates the functions  plot on a map axes.

hold on
figure4 = figure;
ax = usamap({'MA','NY'});

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
states = shaperead('usastatehi',...
 'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
scatterm(42.4075,-71.1190,100,'k','filled')
textm(42.4075+0.2,-71.1190+0.2,'Tufts','FontSize',30)

scatterm(42.4491,-76.4842,100,'k','filled')
textm(42.4491+0.2,-76.4842+0.2,'Cornell','FontSize',30)
tightmap
hold off

This code generates the following figure.

Cornell_Tufts

Cool! Now back to forecasts.  NOAA distributes short term Quantitative Precipitation Forecasts (QPFs) for different durations every six hours.  You can download these forecasts in the form of shapefiles from a NOAA server.  Here’s an example of a 24-hour rainfall forecast made at 8:22 AM UTC on April 29.

fill_94qwbg

Wow, that’s a lot of rain!  Can we plot our own version of this map using Matlab!  You bet!  Again we’ll use usamap, shaperead, and geoshow.  The for loop, (0,1) scaling, and log transform are simply to make the color map more visually appealing for the post.  There’s probably a cleaner way to do this, but this got the job done!

figure5 = figure;
ax = usamap('conus');
S=shaperead('94q2912','UseGeoCoords',true);

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
states = shaperead('usastatehi',...
 'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
p = colormap(jet);

N = max(size(S));
d = zeros(N,1);
for i = 1:N
 d(i) = log(S(i).QPF);
end

y=floor(((d-min(d))/range(d))*63)+1;
col = p(y,:);
for i = 1:N
 geoshow(S(i),'FaceColor',col(i,:),'FaceAlpha',0.5)%,'SymbolSpec', faceColors)
end

This code generates the following figure:

conus_shape

If you are not plotting in the US, Matlab also has a worldmap command.  This works exactly the same as usamap, but now for the world (duh).  Matlab is distibuted with a shapefile ‘landareas.shp’ which contains all of the land areas in the world (duh).  Generating a global map is then trivial:

figure6 = figure;

worldmap('World')
land = shaperead('landareas.shp', 'UseGeoCoords', true);
geoshow(land, 'FaceColor', [0.15 0.5 0.15])

Which generates this figure.

globe

 

Matlab also comes with a number of other included that might be of interest.  For instance, shapefiles detailing the locations of major world cities, lakes, and rivers.  We can plot those with the following code:

figure7 = figure;

worldmap('World')
land = shaperead('landareas.shp', 'UseGeoCoords', true);
geoshow(land, 'FaceColor', [0.15 0.5 0.15])
lakes = shaperead('worldlakes', 'UseGeoCoords', true);
geoshow(lakes, 'FaceColor', 'blue')
rivers = shaperead('worldrivers', 'UseGeoCoords', true);
geoshow(rivers, 'Color', 'blue')
cities = shaperead('worldcities', 'UseGeoCoords', true);
geoshow(cities, 'Marker', '.', 'Color', 'red')

Which generates the figure:

globe_river

But suppose we’re interested in one country or a group of countries.  worldmap works in the same usamap does.  Also, you can plot continents, for instance Europe.

worldmap('Europe')

Europe.png

Those are the basics, but there are many other capabilities, including 3-D projections. I can cover this in a later post if there is interest.

contour

That’s it for now!

9 basic skills for editing and creating vector graphics in Illustrator

This post intends to provide guidance for editing and creating vector graphics using Adobe Illustrator.     The goal is to learn some of the commonly used features to help you get started with your vectorized journey.  Let it be a conceptual diagram, or a logos or cropping people out of your photo, these 9 features (and a fair amount googling) will help you do the job.   Before we begin, it may be worthwhile to distinguish some of the main differences between a raster and a vector graphic.  A raster image is comprised of a collection of squares or pixels, and vector graphics are  based on mathematical formulas that define geometric forms (i.e.  polygons, lines, curves, circles and rectangles), which makes them independent of resolution.

The three main advantages of using vector graphics over raster images are illustrated below:

1. Scalability: Vector graphics scale infinitely without losing any image quality. Raster images guess the colors of missing pixels when sizing up, whereas vector graphics simply use the original mathematical equation to create a consistent shape every time.

scalability-01.png

2. Edibility: Vector files are not flattened, that is, the original shapes of an image exist separately on different layers; this provides flexibility on modifying different elements without impacting the entire image.

edibility.png

3. Reduced file size: A vector file only requires four data points to recreate a square ,whereas a raster image needs to store many small squares.

reduced_file_size.png

 

9 key things to know when getting started with Adobe Illustrator:

1. Starting a project

You can start a new project simply by clicking File> New, and the following window will appear.  You can provide a number of specifications for your document before starting, but you can also customize your document at any stage by clicking File> Document setup (shortcut Alt+Ctrl+P).

pic1.PNG

2. Creating basic shapes

Lines & Arrows

Simply use the line segment tool ( A ) and remember to press the shift button to create perfectly straight lines.  Arrows can be added by using the stroke window (Window> Stroke) and (B) will appear, there’s a variety of arrow styles that you can select from and scale (C).  Finally,  in the line segment tool you can provide the exact length of your line.

Slide1.PNG

Polygons 

Some shapes are already specified in Illustrator (e.g. rectangles , stars and circles (A), but many others such as triangles, need to be specified through the polygon tool.  To draw a triangle I need to specify the number of sides =3 as shown in  (B).

Slide1.PNGCurvatures 

To generate curvatures, you can use the pen tool (A).  Specify two points with the pen, hold the click in the second point and a handle will appear, this handle allows you to shape the curve.  If you want to add more curvatures, draw another point (B) and drag the handle in the opposite direction of the curve.  You can then select the color (C) and the width (D) of your wave.

Slide1.PNG

3. Matching colors

If you need to match the color of an image (A) there are a couple of alternatives:

i) Using the “Eyedrop” tool ( B).  Select the component of the image that you want to match, then select the Eyedrop tool and click on the desired color (C).

Slide1.PNG

ii) Using the color picker panel.  Select the image component with the desired color, then double click on the color picker (highlighted in red) and the following panel should appear.  You can see the exact color code and you can copy and paste it on the image that you wish to edit.

Slide1.PNG

4. Extracting exact position and dimensions

In the following example, I want the windows of my house to be perfectly aligned.  First, in (A), I click on one of the windows of my house and the control panel automatically provides its x and y coordinates, as well its width and height.  Since I want to align both of the windows horizontally, I investigate the  Y coordinates  of the first window and copy it onto the y coordinate of he second window as shown in (B).  The same procedure would apply if you want to copy the dimensions from one figure to another.

twohouses.png

5. Free-style drawing and editing 

The pencil tool (A) is one of my favorite tools in Illustrator, since it corrects my shaky strokes, and allows me to  paint free style.  Once I added color and filled the blob that I drew, it started resembling more like a tree top (B).  You can edit your figure by right clicking it.  A menu will appear enabling you to rotate, reflect, shear and  scale, among other options.  I only wanted to tilt my tree so I specify a mild rotation  (C).

Slide1.PNG

Slide1.PNG

6. Cropping:

Cropping in Illustrator requires clipping masks and I will show a couple of examples using  Bone Bone, a fluffy celebrity cat.  Once a .png image is imported into illustrator, it can be cropped using  one of the following three methods:

Method 1.  Using the direct selection tool

Slide1.PNG

Method 2. Using shapesSlide2.PNG

Method 3. Using the pen tool for a more detailed cropSlide3.PNG

To reverse  to the original image Select Object> Clipping mask> Release or Alt+Ctrl+7

7. Customize the art-board size

If you want your image to be saved without extra white space (B), you can adapt the size of the canvas with  the Art-board tool (or Shft+8) ( A).  Slide1.PNG

8. Using layers:

Layers can help you organize artwork, specially when working with multiple components in an image.  If the Layers panel is not already in your tools, you can access it through Window>  Layers or through the F7 shortcut.  A panel like the  one below should appear.   You can name the layers by double clicking on them, so you can give them a descriptive name.  Note that you can toggle the visibility of each layer on or off. You can also lock a layer if you want to protect it from further change, like the house layer in the example below.  Note that each layer is color-coded, my current working layer is coded in red, so when I select an element in that layer it will be highlighted in red.  The layers can also have sub-layers to store individual shapes, like in the house layer, which is comprised of a collection of rectangles and a triangle.

layer.png

closelayer.png

9.  Saving vector and exporting raster

Adobe Illustrator, naturally allows you to save images in many vector formats, But you can also export raster images such as .png, .jpeg, .bmp, etc.. To export raster images do File> Export and something like the blurry panel below should show up.  You can specify the  resolution and the color of the background. I usually like a transparent background  since it allows you more flexibility when using your image in programs such as Power Point.

background.png

There are many more features available in Illustrator, but this are the ones that I find myself using quite often.  Also, you probably won’t have to generate images from scratch, there are many available resources online. You can download svg images for free which you an later customize in Illustrator.  You can also complement this post by reading Jon Herman’s Scientific figures in Illustrator.

Plotting geographic data from geojson files using Python

Plotting geographic data from geojson files using Python

Hi folks,

I’m writing today about plotting geojson files with Matplotlib’s Basemap.  In a previous post I laid out how to plot shapefiles using Basemap.

geojson is an open file format for representing geographical data based on java script notation.  They are composed of points, lines, and polygons or ‘multiple’ (e.g. multipolygons composed of several polygons), with accompanying properties.  The basic structure is one of names and vales, where names are always strings and values may be strings, objects, arrays, or logical literal.

The geojson structure we will be considering here is a collection of features, where each feature contains a geometry and properties.  Each geojson feature must contain properties and geometry.  Properties could be things like country name, country code, state, etc.  The geometry must contain a type (point, line, polygons, etc.) and coordinates (likely an array of lat-long). Below is an excerpt of a geojson file specifying Agro-Ecological Zones (AEZs) within the various GCAM regions.

{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },

"features": [
{ "type": "Feature", "id": 1, "properties": { "ID": 1.000000, "GRIDCODE": 11913.000000, "CTRYCODE": 119.000000, "CTRYNAME": "Russian Fed", "AEZ": 13.000000, "GCAM_ID": "Russian Fed-13" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ 99.5, 78.5 ], [ 98.33203125, 78.735787391662598 ], [ 98.85723876953125, 79.66796875 ], [ 99.901641845703125, 79.308036804199219 ], [ 99.5, 78.5 ] ] ] ] } },
{ "type": "Feature", "id": 2, "properties": { "ID": 2.000000, "GRIDCODE": 11913.000000, "CTRYCODE": 119.000000, "CTRYNAME": "Russian Fed", "AEZ": 13.000000, "GCAM_ID": "Russian Fed-13" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ 104.5, 78.0 ], [ 104.0, 78.0 ], [ 99.5, 78.0 ], [ 99.5, 78.5 ], [ 100.2957763671875, 78.704218864440918 ], [ 102.13778686523437, 79.477890968322754 ], [ 104.83050537109375, 78.786871910095215 ], [ 104.5, 78.0 ] ] ] ] } },
{ "type": "Feature", "id": 3, "properties": { "ID": 3.000000, "GRIDCODE": 2713.000000, "CTRYCODE": 27.000000, "CTRYNAME": "Canada", "AEZ": 13.000000, "GCAM_ID": "Canada-13" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ -99.5, 77.5 ], [ -100.50860595703125, 77.896504402160645 ], [ -101.76053619384766, 77.711499214172363 ], [ -104.68202209472656, 78.563323974609375 ], [ -105.71781158447266, 79.692866325378418 ], [ -99.067413330078125, 78.600395202636719 ], [ -99.5, 77.5 ] ] ] ] } }
}

Now that we have some understanding of the geojson structure, plotting the information therein should be as straightforward as traversing that structure and tying geometries to data.  We do the former using the geojson python package and the latter using pretty basic python manipulation.  To do the actual plotting, we’ll use PolygonPatches from the descartes library and recycle most of the code from my previous post.

We start by importing the necessary libraries and then open the geojson file.

import geojson
from descartes import PolygonPatch
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np

with open("aez-w-greenland.geojson") as json_file:
    json_data = geojson.load(json_file)

We then define a MatplotLib Figure, and generate a Basemap object as a ‘canvas’ to draw the geojson geometries on.

plt.clf()
ax = plt.figure(figsize=(10,10)).add_subplot(111)#fig.gca()

m = Basemap(projection='robin', lon_0=0,resolution='c')
m.drawmapboundary(fill_color='white', zorder=-1)
m.drawparallels(np.arange(-90.,91.,30.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)
m.drawmeridians(np.arange(0., 360., 60.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)
m.drawcoastlines(color='0.6', linewidth=1)

Next, we iterate over the nested features in this file and pull out the coordinate list defining each feature’s geometry (line 2).  In lines 4-5 we also pull out the feature’s name and AEZ, which I can tie to GCAM data.

for i in range(2799):
    coordlist = json_data.features[i]['geometry']['coordinates'][0]
    if i < 2796:
        name = json_data.features[i]['properties']['CTRYNAME']
        aez =  json_data.features[i]['properties']['AEZ']

    for j in range(len(coordlist)):
        for k in range(len(coordlist[j])):
            coordlist[j][k][0],coordlist[j][k][1]=m(coordlist[j][k][0],coordlist[j][k][1])

    poly = {"type":"Polygon","coordinates":coordlist}#coordlist
    ax.add_patch(PolygonPatch(poly, fc=[0,0.5,0], ec=[0,0.3,0], zorder=0.2 ))

ax.axis('scaled')
plt.draw()
plt.show()

Line 9 is used to convert the coordinate list from lat/long units to meters.  Depending on what projection you’re working in and what units your inputs are in, you may or may not need to do this step.

The final lines are used to add the polygon to the figure, and to make the face color of each polygon green and the border dark green. Which generates the figure:

for_blog

To get a bit more fancy, we could tie the data to a colormap and then link that to the facecolor of the polygons.  For instance, the following figure shows the improvement in maize yields over the next century in the shared socio-economic pathway 1 (SSP 1), relative to a reference scenario (SSP 2).

maize_ssp1

Saving d3.parcoords to SVG

d3.parcoords is a great library for making interactive parallel coordinate plots. A major issue, however, is that it is pain to get the resulting plots into a format suitable for publication. In this blog post, I will show how we can turn a d3.parcoords plot into an SVG document, which we can save locally. SVG is an XML based format for vector graphics, so it is ideal for publications.

This blog post is an example of how to get the SVG data. It is however far from complete, and there might be better ways of achieving some of the steps. Any comments or suggestions on how to improve the code are welcome. I wrote this while learning javascript, without any prior experience with respect to web technology.

First, how is a d3.parcoords plot structured? It is composed of five elements: 4 HTML5 canvas layers, and a single SVG layer. the SVG layer contains the axis for each dimension. The 4 canvas layers are marks, highlight, brushed, and foreground. I am not sure what the function is of the first two, but brushed contains the lines that are selected through brushing, while foreground contains all the remaining lines.

In order to export a d3.parcoords figure as pure svg, we need to somehow replace the HTML canvas with something that has the same interface, but generates SVG instead. Luckily there are several javascript libraries that do this. See http://stackoverflow.com/questions/8571294/method-to-convert-html5-canvas-to-svg for an overview. In this example, I am using http://gliffy.github.io/canvas2svg/ , which is a recent library that still appears to be maintained.

The basic idea is the following:

  • replace the normal HTML5 canvas.context for each layer with the one from canvas2svg, and render the plot
  • extract the axis svg
  • extract the SVG from the 5 canvas layers, and combine the 5 layers into a single svg document
  • save it
  • reset the canvas

To make this work, we are depending on several javascript libraries in addition to the default dependencies of d3.parcoords. These are

Replace canvas.context

In order to replace the canvas.context for each layer, we iterate over the names of the layers. d3.parcoords saves the contexts in an internal object, indexed by name. We keep track of the old context for each layer, because this makes restoring a lot easier at the end. We instantiate the C2S context (the class provided by canvas2svg), by specifying the width and height of the plot. In this case, I have hardcoded them for simplicity, but it would be better to extract them from the HTML or CSS.

const layerNames = ["marks", "highlight", "brushed", "foreground"];

const oldLayers = {};
let oldLayerContext;
let newLayerContext;
let layerName;
for (let i=0; i<canvasLayers.length; i++){
    layerName = layerNames[i];

    oldLayerContext = pc0.ctx[layerName]; //pc0 is the d3.parcoords plot
    newLayerContext = new C2S(720, 200); 

    oldLayers[layerName] = oldLayerContext;
    pc0.ctx[layerName] = newLayerContext;
}
pc0.render();

Extract the Axis svg

Getting the axis svg is straightforward. We select the svg element in the dom, serialise it to a string and next use jQuery to create a nice XML document out of the string.

const svgAxis = new XMLSerializer().serializeToString(d3.select('svg').node());
const axisXmlDocument = $.parseXML(svgAxis);

The only problem with this approach is that the SVG does not contain the style information, which is provided in the CSS. So, we need to inline this information. To do so, I created two helper functions. The first helper function allows us to set an attribute on elements that have the same tag. The second does the same, but based on class name.

// helper function for saving svg
function setAttributeByTag(xmlDocument, tagName, attribute, value){
    const paths = xmlDocument.getElementsByTagName(tagName);
    for (let i = 0; i < paths.length; i++) {
        paths[i].setAttribute(attribute, value);
    }
}

// helper function for saving svg
function setAttributeByClass(xmlDocument, className, attribute, value){
    const paths = xmlDocument.getElementsByClassName(className);
    for (let i = 0; i < paths.length; i++) {
        paths[i].setAttribute(attribute, value);
    }
}

We can now  use  these helper functions to inline some CSS information. Note that this is an incomplete subset of all the CSS information used by d3.parcoords. A future extension would be to extract all the d3.parcoord style information from the CSS and inline it.

setAttributeByTag(axisXmlDocument, "axis", "fill", "none");
setAttributeByTag(axisXmlDocument, "path", "stroke", "#222");
setAttributeByTag(axisXmlDocument, "line", "stroke", "#222");
setAttributeByClass(axisXmlDocument, "background", "fill", "none");

Extract the SVG from each layer

We now  have an XML document to which we can add the SVG data of each of our layers. In order to keep track of the structure of the SVG, I have chosen to first create a new group node, and subsequently add each layer to this new group as a child. To make sure that this group is positioned correctly, I clone the main group node of the axis svg, remove it’s children, and insert this new node at the top of the XML document.

const oldNode = axisXmlDocument.getElementsByTagName('g')[0];
const newNode = oldNode.cloneNode(true);
while (newNode.hasChildNodes()){
    newNode.removeChild(newNode.lastChild);
}
axisXmlDocument.documentElement.insertBefore(newNode, oldNode);

There is some trickery involved in what I am doing here. SVG groups are rendered on top of each other, in the order in which they appear in the XML document. It appears that one can provide a z-order as well according to the SVG 2.0 specification, but I have not pursued that direction here. By adding the newly created node to the top, I ensure that the axis information is at the end of the XML document, and thus always on top of all the other layers. For the same reason, I have also deliberately sorted the canvas layer names.

Now  that we have a new node, we can iterate over our canvas layers and extract the svg data from them. Next, we parse the xml string to turn it into an XML document. We have to overwrite a transform attribute that is used when working on a retina screen, this matters for a html canvas but not for svg. For convenience, I also add the layer name as a class attribute, so in our SVG, we can easily spot each of the canvas layers. The XML document for a given layer contains two main nodes. The first node contains the defs tag, which we don’t need. The second node contains the actual SVG data, which is what we do need.

let svgLines;
let xmlDocument;
for (let i=0; i<layerNames.length; i++){
    // get svg for layer
    layerName = layerNames[i];
    svgLines = pc0.ctx[layerName].getSerializedSvg(true);
    xmlDocument = $.parseXML(svgLines);

    // scale is set to 2,2 on retina screens, this is relevant for canvas
    // not for svg, so we explicitly overwrite it
    xmlDocument.getElementsByTagName("g")[0].setAttribute("transform", "scale(1,1)");

    // for convenience add the name of the layer to the group as class
    xmlDocument.getElementsByTagName("g")[0].setAttribute("class", layerName);

    // add the group to the node
    // each layers has 2 nodes, a defs node and the actual svg
    // we can safely ignore the defs node
    newNode.appendChild(xmlDocument.documentElement.childNodes[1]);
}

Save it

We have all our SVG data in the xml document. All that is left is to turn this back into a string, format the string properly, turn it into a blob, and save it. We can achieve this in three lines.

// turn merged xml document into string
// we also beautify the string, but this is optional
const merged = vkbeautify.xml(new XMLSerializer().serializeToString(axisXmlDocument.documentElement));

// turn the string into a blob and use FileSaver.js to enable saving it
const blob = new Blob([merged], {type:"application/svg+xml"});
saveAs(blob, "parcoords.svg");

Reset context

We now  have saver our SVG file locally, but we have to still put back our old canvas context’s. We have stored these, so we can simply loop over the layer names and put back the old context. In principle, this last step might not be necessary, but I work on machines with a retina screen and ran into scaling issues when trying to use C2s context’s outside of the save function.

// we are done extracting the SVG information so
// put the original canvas contexts back
for (let i=0; i<layerNames.length; i++){
    pc0.ctx[layerNames[i]] = oldLayers[layerNames[i]]
}
pc0.render();

Putting it all together

I have a repo on github with the full code including dependencies etc: https://github.com/quaquel/parcoords .

The code shown in this blog is not complete. For example, brushed plots will not display nice and require some post processing of the SVG.

For those that are more familiar with D3.parcoords, note how the coloring of the lines is dependent on which axis you select. I have connected the color to a click event on the axis to make this possible.

Survival Function Plots in R

Survival Function Plots in R

A survival function (aka survivor function or reliability function) is a function often used in risk management for visualizing system failure points. For example, it can be used to show the frequency of a coastal defense structure failure (such as a breach in a levee) in a future state of the world.

The function itself is quite simple. For a distribution of events, the survival function (SF) is 1-CDF where CDF is the cumulative distribution function. If you’re deriving the distribution empirically, you can substitute the CDF with the cumulative frequency. It is often plotted on a semi-log scale which makes tail-area analysis easier.

I’ve written some R code that creates a primitive Survival Function plot from a vector of data.  Below is the function (Note: You can find the code and an example of its usage on bitbucket https://bitbucket.org/ggg121/r_survival_function.git)

plot.sf <- function(x, xlab=deparse(substitute(x)), left.tail=F,
  ylab=ifelse(left.tail, "SF [Cum. Freq.]", "SF  [1 - Cum. Freq.]"),
  make.plot=T, ...)
{
  num.x <- length(x)
  num.ytics <- floor(log10(num.x))
  sf <- seq(1,1/num.x,by=-1/num.x)
  
  if(left.tail){
    order.x <- order(x, decreasing=T)
    order.sf <- sf[order(order.x)]
    
  }  else {
    order.x <- order(x)
    order.sf <- sf[order(order.x)]
  }
  
  if(make.plot) {
    plot(x[order.x], sf, log="y", xlab=xlab, ylab=ylab, yaxt="n", ...)
    axis(2, at=10^(-num.ytics:0), 
         label=parse(text=paste("10^", -num.ytics:0, sep="")), las=1)
  }
  invisible(order.sf)
}

Download and source the code at the start of your R script and you’re good to go. The function, by default, creates a plot in the current plotting device and invisibly returns the survival function values corresponding to the vector of data provided. The parameter left.tail sets the focus on the left-tail of the distribution (or essentially plots the CDF on a semi-log scale). By default, the function puts the focus on the right tail (left.tail = FALSE). The make.plot parameter allows you to toggle plotting of the survival function (default is on or make.plot=TRUE. This is useful when you simply need the survival function values for further calculations or custom plots. Additional parameters are passed to the plot() function. Below is an example (which is also available in the repository).

# Source the function
source("plot_sf.r")

# Set the seed
set.seed(1234)

# Generate some data to use
my.norm <- rnorm(10000, 10, 2)
my.unif <- runif(10000)
my.weib <- rweibull(10000, 20, 5)
my.lnorm <- rlnorm(10000, 1, 0.5)


# Make the plots ----------------------
par(mfrow=c(2,2), mar=c(5,4,1,1)+0.1)

# Default plot settings
plot.sf(my.norm)

# Function wraps the standard "plot" function, so you can pass
# the standard "plot" parameters to the function
plot.sf(my.unif, type="l", lwd=2, col="blue", bty="l",
        ylab="Survival", xlab="Uniform Distribution")

# If the parameter "left.tail" is true, the plot turns into 
# a cumulative frequency plot (kind of like a CDF) that's plotted
# on a log scale.  This is good for when your data exhibits a left or
# negative skew.
plot.sf(my.weib, type="l", left.tail=T, xlab="Left-tailed Weibull Dist.")

# The function invisibly returns the survival function value.
lnorm.sf <- plot.sf(my.lnorm, type="l")
points(my.lnorm, lnorm.sf, col="red")
legend("topright", bty="n", 
       legend=c("Function Call", "Using returned values"), 
       lty=c(1,NA), pch=c(NA,1), col=c("black", "red") )

# The 'make.plot' parameter toggles plotting.
# Useful if you just want the survival function values.
norm.sf <- plot.sf(my.norm, make.plot=F)

And here’s the resulting figure from this example:

survival_function_plot_example

Now you can easily show, for example, tail-area frequency of events. For example, below is a survival function plot of a normal distribution:

survival_plot_norm_didactic

For this example, we can imagine this as a distribution of flood heights (x-axis would be flood height – note that a real distribution of flood heights would likely look drastically different from a normal distribution). With this visualization, we can easily depict the “1 in 10” or the “1 in 1,000” flood height by following the appropriate survival function value over to the corresponding flood height on the plot. Alternatively, you can determine the return period of a given flood height by following the flood height up to the plot and reading off the survival function value. Comparing multiple distributions together on a single plot (think deep uncertainty) can produce interesting decision-relevant discussion about changes in return periods for a given event or the range of possible events for a given return period.

I hope this post is useful. Survival function plots are incredibly versatile and informative…and I’ve only started to scratch the surface!