Plotting change on maps

or how to replicate the New York Times presidential election shift map

This week’s blogpost is a visualization demo replicating a popular map from last year. The map below shows the shift in voter margin between the 2016 and 2020 Presidential Elections by the two major political parties in the United States. The direction and color of the arrows indicates the party and the length of the arrow indicates the shift. This type of figure can be useful in visualizing many types of spatially distributed changes (e.g. population change in a city, change in GDP per capita, losses and gains). This blogpost shows how to replicate it in Python using commonly used packages.

Screengrab of the original graphic from the NYT website. Original can be found here: https://www.nytimes.com/interactive/2020/11/03/us/elections/results-president.html

Even though the creators of the original provide their 2020 data, their 2016 data is not available so the data I’ll be using came from the MIT Election Data and Science Lab and can be downloaded here: https://doi.org/10.7910/DVN/VOQCHQ. All the code and data to replicate my figure can be found in this repository: https://github.com/antonia-had/election_data_shift

The main packages we’ll be using for this are cartopy and matplotlib to create the map and annotate elements on it, pandas for some simple data analysis and haversine to convert distances on the map (which you might not need if you’re applying the code to a small spatial scale).

First thing we do is load our packages and data. counties.csv contains the latitude and longitude for every country we’ll be plotting. countypres_2000-2020.csv contains our downloaded election data. As you can see in the code comments, I had to clean out some of the datapoints due to inconsistencies or errors. I’ll also only be plotting the contiguous US to simplify the exercise, but you can definitely include code to also plot Alaska and Hawaii in the same figure.

import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import pandas as pd
import cartopy.io.shapereader as shpreader
from haversine import inverse_haversine, Direction

# Read in county position data
pos_data = pd.read_csv('./data/counties.csv', delimiter=',', index_col=0)

# Read in county election data
# Data from https://doi.org/10.7910/DVN/VOQCHQ
# Data points without county FIPS code removed
all_election_data = pd.read_csv('./data/countypres_2000-2020.csv')
# Filter data to only keep years 2016 and 2020
# Dataset reports issues with Alaska data so filter those out too
# Missing data for 2020 for some counties
# County with FIPS code 46113 was assigned a new FIPS code (46102) which is changed in the downloaded data
mask = (all_election_data['year'] >= 2016) & \
       (all_election_data['state'] != 'ALASKA') &\
       (all_election_data['state'] != 'HAWAII') & \
       (all_election_data['county_fips'] != 11001) & \
       (all_election_data['county_fips'] != 51515) & \
       (all_election_data['county_fips'] != 36000)
election_data = all_election_data[mask]

Next we calculate the percentage of votes each party gained at each election and compare the results between the two elections to calculate their shift. A simplifying assumption here is that we’re only focussing on the top two parties (but you can do more with different color arrows for example). We’re also copying the latitude and longitude of each county so everything is in one dataframe.

# Calculate vote percentage per party
election_data['percentagevote'] = election_data['candidatevotes']/election_data['totalvotes'] * 100

# Create new dataframe to store county change results
shift = election_data[['state', 'county_name', 'county_fips']].copy()
# Drop duplicate rows (original dataframe was both 2016 and 2020)
shift = shift.drop_duplicates(['county_fips'])

# Create columns to store change for every party
shift['DEMOCRAT'] = 0.0
shift['REPUBLICAN'] = 0.0

#Create columns for latitude and longitude so everything is in the same dataframe
shift['lat'] = 0.0
shift['lon'] = 0.0

# Iterate through every county and estimate difference in vote share for two major parties
for index, row in shift.iterrows():
    county = row['county_fips']
    for party in ['DEMOCRAT', 'REPUBLICAN']:
        previous_result = election_data.loc[(election_data['year'] == 2016) &
                                            (election_data['county_fips'] == county) &
                                            (election_data['party'] == party)]['percentagevote'].values[0]
        new_result = election_data.loc[(election_data['year'] == 2020) &
                                       (election_data['county_fips'] == county) &
                                       (election_data['party'] == party)]['percentagevote'].values[0]
        # If any of the two results is nan assign zero change
        if pd.isna(new_result) or pd.isna(previous_result):
            shift.at[index, party] = 0
        else:
            shift.at[index, party] = new_result - previous_result
    # Combine lat and long values also so it's all in one dataframe
    shift.at[index, 'lat'] = pos_data.at[county, 'lat']
    shift.at[index, 'lon'] = pos_data.at[county, 'lon']

To create our map we do the following.

Set up matplotlib figure with the map extent of the contiguous United States and use cartopy geometries to add the shapes of all states.

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(1, 1, 1, projection=ccrs.LambertConformal(), frameon=False)
ax.set_extent([-120, -74, 24, 50], ccrs.PlateCarree())
# Add states shape
shapename = 'admin_1_states_provinces_lakes'
states_shp = shpreader.natural_earth(resolution='110m',
                                     category='cultural', name=shapename)
ax.add_geometries(shpreader.Reader(states_shp).geometries(), ccrs.PlateCarree(),
                  facecolor='#e5e5e5', edgecolor='white', zorder=0)

We then need to determine how the shift should be plotted in each county. A simplifying assumption here is that we’re showing the largest positive shift (i.e., if both parties lost votes we’re only showing a small grey point). There’s several ways to draw an arrow at each point, depending on what you’d like to show and the complexity you’re comfortable with. The way I am showing here is exploiting the matplotlib annotate function, typically used to annotate a figure with text and arrows.

The way I’m going about this is a little mischievous but works: I’m only using the arrow component of it with a blank text annotation and identify a point where each arrow should be pointing to by using each county’s lat and long and the estimated shift. If this was a simple matplotlib figure using cartesian coordinates, calculating the end point would be simple trigonometry. Since latitude and longitude are not on a cartesian plane, we need to convert them using the haversine formula (or its inverse). It’s fairly easy to implement yourself but since there already exceeds a handy python package for it, I’m using that instead. The transform function I am using up top is necessary for matplotlib to know how to transform the points from the annotation function (typically not necessary to do if using, say, ax.scatter()), some explanation of why that is can be found here. The colors and all other customization is done so the figure looks as close as possible to the original.

transform = ccrs.PlateCarree()._as_mpl_transform(ax)
for index, row in shift.iterrows():
    # Determine arrow color
    dem_shift = shift.at[index, 'DEMOCRAT']
    rep_shift = shift.at[index, 'REPUBLICAN']
    # Check if both lost votes, then set arrow to grey
    if dem_shift<0 and rep_shift<0:
        arrow_color = 'grey'
        ax.scatter(shift.at[index, 'lon'], shift.at[index, 'lat'],
                   color=arrow_color, transform=ccrs.PlateCarree(),
                   s=0.5)
    # If at least one of them gained votes
    else:
        if dem_shift >= rep_shift:
            arrow_color = '#1460a8'
            direction = Direction.NORTHWEST
            change = dem_shift
        else:
            arrow_color = '#bb1d2a'
            direction = Direction.NORTHEAST
            change = rep_shift
        end_location = inverse_haversine((shift.at[index, 'lat'], shift.at[index, 'lon']), change*25, direction)[::-1]
        ax.annotate(" ", xytext=(shift.at[index, 'lon'], shift.at[index, 'lat']), xy=end_location,
                    arrowprops=dict(facecolor=arrow_color, edgecolor=arrow_color,
                                    width=0.2, headwidth=3, headlength=5),
                    xycoords=transform, zorder=1)
plt.tight_layout()
plt.savefig('electionshiftmap.png', dpi=300)

The resulting figure looks like this, which I am calling pretty close, considering the dataset differences. Tinkering with colors, widths, lengths and transforms can get you a different look if you’re after that.

Pam agrees:

MORDM Basics VI: Processing the output and reevaluating for robustness

In the previous post, we conducted a basic WaterPaths tutorial in which we ran a simulation-optimization of the North Carolina Research Triangle test case (Trindade et al, 2019); across 1000 possible futures, or states of the world (SOWs). To briefly recap, the Research Triangle test case consists of three water utilities in Cary (C), Durham (D) and Raleigh (R). Cary is the main supplier, having a water treatment plant of its own. Durham and Raleigh purchase water from Cary via treated transfers.

Having obtained the .set file containing the Pareto-optimal decision variables and their respective performance objective values, we will now walk through the .set file processing and visualize the decision variables and performance objective space.

Understanding the .set file

First, it is important that the .set file makes sense. The NC_output_MS_S3_N1000.set file should have between 30-80 rows, and a total of 35 columns. The first 20 columns contain values of the decision variables. Note that only Durham and Raleigh have water transfer triggers as they purchase treated water from Cary.

  1. Restriction trigger, RT (C, D, R)
  2. Transfer trigger, TT (D, R)
  3. Jordan Lake allocation, JLA (C, D, R)
  4. Reserve fund contribution as a percentage of total annual revenue, RC (C, D, R)
  5. Insurance trigger, IT (C, D, R)
  6. Insurance payments as a percentage of total annual revenue, IP (C, D, R)
  7. Infrastructure trigger, INF (C, D, R)

The last 15 columns contain the objective values for the following performance objectives of all three utilities:

  1. Reliability (REL) to be maximized
  2. Restriction frequency (RF) to be minimized
  3. Infrastructure net present cost (INF_NPC) to be minimized
  4. Peak financial cost of drought mitigation actions (PFC) to be minimized
  5. Worst-case financial cost of drought mitigation actions (WCC) to be minimized

This reference set needs to be processed to output a .csv file to enable reevaluation for robustness analysis. To do so, run the post_processing.py file found in this GitHub repository in the command line:

python post_processing.py

In addition to post-processing the optimization output files, this file also conduct regional minimax operation, where each regional performance objective is taken to be the objective value of the worst-performing utility (Gold et al, 2019).

This should output two files:

  1. NC_refset.csv No header row. This is the file that will be used to run the re-evaluation for robustness analysis in the next blog post.
  2. NC_dvs_objs.csv Contains a header row. This file that contains the labeled decision variables and objectives, including the minimax regional performance objectives. Will be used for visualizing the reference set’s decision variables and performance objectives.

Visualizing the reference set

Due to the higher number of decision variables, we utilize parallel axis plots to identify how varying the decision variables can potentially affect certain performance objectives. Here, we use the regional reliability performance objective, REL, as an example. Figure 1 below demonstrates how all decision variables vary with regional reliability.

Figure 1: All decision variables for the three utilities. A darker blue indicates a higher degree of reliability.

From Figure 1, most solutions found via the optimization conducted in the previous blog post seem to have relatively high reliability across the full range of decision variable values. It is unclear as to how each decision variable might affect regional reliability. It is thus more helpful to identify specific sets of decision variables (or policies) that enable the achievement of reliability beyond a certain threshold.

With this in mind, we assume that all members of the Triangle require that their collective reliability be at least 98%. This results in the following figure:

Figure 2: All decision variables across the three utilities. The dark lines represent the policies that are at least 98% reliable.

Figure 2 has one clear takeaway: Pareto-optimality does not indicate satisfactory performance. In addition, setting this threshold make the effects of each decision variable clearer. It can be seen that regional reliability is significantly affected by both Durham and Raleigh’s infrastructure trigger (INF). Desirable levels of regional reliability can be achieved when Durham sets a high INF value. Conversely, Raleigh can set lower INF values to benefit from satisfactory reliability. Figure 2 also shows having Durham set a high insurance trigger (IT) may benefit the regional in terms of reliability.

However, there are caveats. Higher INF and IT values for Durham implies that the financial burden of investment and insurance payments are being borne by Raleigh and Cary, as Durham is able to withstand more risk without having to trigger an investment or infrastructure payment. This may affect how each member utility perceives their individual risks and benefits by forming a cooperative contract.

The code to plot these figures can be found under ‘refset_parallel.py’ in the repository.

Robustness analysis and what’s next

So how is setting a threshold value of regional reliability significant?

Within the MORDM framework, robustness is defined using a multivariate satisficing metric (Gold et al, 2019). Depending on the requirements of the stakeholders, a set of criteria is defined that will then be used distinguish between success (all criteria are met) and failure (at least one criteria is not met). Using these criteria, the rest of Pareto-optimal policies are simulated across a number of uncertain SOWs. Each policy’s robustness is then represented by the percent of SOWs in which it meets the minimum performance criteria that has been set.

In this post, we processed the reference set and visualized its decision variable space with respect to each variable’s effect on the reliability performance objective. A similar process can be repeated across all utilities for all performance objectives.

Using the processed reference set, we will conduct multi-criterion robustness analysis using two criteria:

  1. Regional reliability should be at least 98%
  2. Regional restriction frequency should be less than or equal to 20%

We will also conduct sensitivity analysis to identify the the decision variables that most impact regional robustness. Finally, we will conduct scenario discovery to identify SOWs that may cause the policies to fail.

References

Gold, D. F., Reed, P. M., Trindade, B. C., & Characklis, G. W. (2019). Identifying actionable compromises: Navigating multi‐city robustness conflicts to discover cooperative safe operating spaces for regional water supply portfolios. Water Resources Research, 55(11), 9024–9050. https://doi.org/10.1029/2019wr025462

Trindade, B. C., Reed, P. M., & Characklis, G. W. (2019). DEEPLY UNCERTAIN PATHWAYS: Integrated multi-city Regional Water Supply Infrastructure Investment and portfolio management. Advances in Water Resources, 134, 103442. https://doi.org/10.1016/j.advwatres.2019.103442

Networks on maps: exploring spatial connections using NetworkX and Basemap

This blogpost is about generating network graphs interlaid on spatial maps. I’ll be using the data provided by this paper (in the supplementary material) which estimates flows of food across US counties. All the code I’m using here can be found here.

The dataset included in erl_14_8_084011_sd_3.csv of the supplementary material lists the tons of food transported per food category, using the standard classification of transported goods (SCTG) food categories included in the study. The last two columns, ori and des, indicate the origin and destination counties of each flow, using FIPS codes.

To draw the network nodes (the counties) in their geographic locations I had to identify lat and lon coordinates for each county using its FIPS code, which can be found here 1.

Now, let’s these connections in Python, using NetworkX and Basemap. The entire script is here, I’ll just be showing the important snippets below. In the paper, they limit the visualization to the largest 5% of food flows, which I can confirm is necessary otherwise the figure would be unreadable. We first load the data using pandas (or other package that reads csv files), identify the 95th percentile and restrict the data to only those 5% largest flows.

data = pd.read_csv('erl_14_8_084011_sd_3.csv')
threshold = np.percentile(data['total'], 95)
data = data.loc[(data['total'] > threshold)]

Using NetworkX, we can directly create a network out of these data. The most important things I need to define are the dataframe column that lists my source nodes, the column that lists my destination nodes and which attribute makes up my network edges (the connections between nodes), in this case the total food flows.

G = nx.from_pandas_edgelist(df=data, source='ori', target='des', edge_attr='total',create_using = nx.DiGraph())

Drawing this network without the spatial information attached (using the standard nx.draw(G)) looks something like below, which does hold some information about the structure of this network, but misses the spatial information we know to be associated with those nodes (counties).

To associate the spatial information with those nodes, we’ll employ Basemap to create a map and use its projection to convert the lat and lon values of each county to x and y positions for our matplotlib figure. When those positions are estimated and stored in the pos dictionary, I then draw the network using the specific positions. I finally also draw country and state lines. You’ll notice that I didn’t draw the entire network but only the edges (nx.draw_networkx_edges) in an effort to replicate the style of the figure from the original paper and to declutter the figure.

plt.figure(figsize = (12,8))
m = Basemap(projection='merc',llcrnrlon=-160,llcrnrlat=15,urcrnrlon=-60,
urcrnrlat=50, lat_ts=0, resolution='l',suppress_ticks=True)
mx, my = m(pos_data['lon'].values, pos_data['lat'].values)
pos = {}
for count, elem in enumerate(pos_data['nodes']):
     pos[elem] = (mx[count], my[count])
nx.draw_networkx_edges(G, pos = pos, edge_color='blue', alpha=0.1, arrows = False)
m.drawcountries(linewidth = 2)
m.drawstates(linewidth = 0.2)
m.drawcoastlines(linewidth=2)
plt.tight_layout()
plt.savefig("map.png", dpi = 300)
plt.show()

The resulting figure is the following, corresponding to Fig. 5B from the original paper.

I was also interested in replicating some of the analysis done in the paper, using NetworkX, to identify the counties most critical to the structure of the food flow network. Using the entire network now (not just the top 5% of flows) we can use NetworkX functions to calculate each node’s degree and between-ness centrality. The degree indicates the number of nodes a node is connected to, between-ness centrality is an indicator of the fraction of shortest paths between two nodes that pass through a specific node. These are network metrics that are unrelated to the physical distance between two counties and can be used (along with several other metrics) to make inferences about the importance and the position of a specific node in a network. We can calculate them in NetworkX as shown below and plot them using simple pyplot commands:

connectivity = list(G.degree())
connectivity_values = [n[1] for n in connectivity]
centrality = nx.betweenness_centrality(G).values()

plt.figure(figsize = (12,8))
plt.plot(centrality, connectivity_values,'ro')
plt.xlabel('Node centrality', fontsize='large')
plt.ylabel('Node connectivity', fontsize='large')
plt.savefig("node_connectivity.png", dpi = 300)
plt.show()

The resulting figure is shown below, matching the equivalent Fig. 6 of the original paper. As the authors point out, there are some counties in this network, those with high connectivity and high centrality, that are most critical to its structure: San Berndardino, CA; Riverside, CA; Los Angeles, CA; Shelby, TN; San Joaquin, CA; Maricopa, AZ; San Diego, CA; Harris, TX; and Fresno, CA.

1 – If you are interested in how this is done, I used the National Counties Gazetteer file from the US Census Bureau and looked up each code to get its lat and lon.

Parallel axis plots for the absolute beginner

A parallel axis plot is a simple way to convey a lot of information in a meaningful and easy-to-understand way. Also known as parallel coordinate plots (PCP), it is a visualization technique used to analyze multivariate numerical data (Weitz, 2020), or in the case of multi-objective optimization, to analyze tradeoffs between multiple conflicting objectives. As someone new to the field of multi-objective optimization, I found them particularly helpful as I tried to wrap my head around the multi-dimensional aspects of this field.

There are multiple tools in Python that you can use to generate PCPs. There are several different posts by Bernardo and Jazmin that utilize the Pandas and Plotting libraries to do so. In this post, I would like to explain a little about how you can generate a decent PCP using only Numpy and Matplotlib.

For context, I used a PCP to contrast the non-dominated solutions from the entire reference set of the optimized GAA problem reference set.

For the beginner, the figure above demonstrates three important visualization techniques in generating PCPs: color, brushing, and axis ordering. Firstly, it is important to consider using colors that stand on opposite sides on the color wheel to contrast the different types of information you are presenting. Next, brushing should be used to divert the viewer’s attention away from any information deemed unnecessary, highlight vital information, or to prove a point using juxtaposition. Finally, the ordering of the axes is important, particularly when presenting conflicting objectives. It is best for all axes to be oriented in one “direction of preference”, so that the lines between each adjacent axis can represent the magnitude of the tradeoff between two objectives. Thus, the order in which these axes are placed will significantly affect the way the viewer perceives the tradeoffs, and should be well-considered.

To help with understanding the how to generate a PCP, here is a step-by-step walk-through of the process.

1. Import all necessary libraries, load data and initialize the Matplotlib figure

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import ticker

# load data
all_soln = np.loadtxt(open("GAA-reference-set.csv"), delimiter=",")
nd_indices = np.loadtxt(open("non-dominated-index.csv"), delimiter=",")

# identify and list the objectives of the reference set
objs = ['NOISE', 'WEMP', 'DOC', 'ROUGH', 'WFUEL', 'PURCH', 'RANGE', 'LDMAX', 'VCMAX', 'PFPF']

# create an array of integers ranging from 0 to the number of objectives                    
x = [i for i, _ in enumerate(objs)]

# sharey=False indicates that all the subplot y-axes will be set to different values
fig, ax  = plt.subplots(1,len(x)-1, sharey=False, figsize=(15,5))

Two sets of data are loaded:

  • all_soln: the entire reference set
  • nd_indices: an array of row indices of the non-dominated solutions from all_soln

In Line 16, we are initializing a figure fig and an array of axis objects ax. I find that having an array of axes helps me better control tick locations and labeling, since I can iterate over them in a loop.

Bear in mind that this is simply an example. It is also possible to obtain the non-dominated set directly from the the reference set by performing a Pareto sort.

2. Normalize the objective values in all_soln

min_max_range = {}

for i in range(len(objs)):
    all_soln[:,i] = np.true_divide(all_soln[:,i] - min(all_soln[:,i]), np.ptp(all_soln[:,i]))
    min_max_range[objs[i]] = [min(all_soln[:,i]), max(all_soln[:,i]), np.ptp(all_soln[:,i])]

All values in all_soln are normalized by subtracting the minimum value from each objective, then dividing it by the range of values for that objective. The min_max_range dictionary contains the minimum, maximum and range of values for each objective. This will come in handy later on.

3. Iterate through all the axes in the figure and plot each point

I used the enumerate function here. It may seem somewhat confusing at first, but it basically keeps count of your iterations as your are iterating through an object (ie: a list, an array). More information on how it works can be found here.

for i, ax_i in enumerate(ax):
    for d in range(len(all_soln)):
        if ((d in nd_indices)== False):
            if (d == 0):
                ax_i.plot(objs, all_soln[d, :], color='lightgrey', alpha=0.3, label='Dominated', linewidth=3)
            else:
                ax_i.plot(objs, all_soln[d, :], color='lightgrey', alpha=0.3, linewidth=3)
    ax_i.set_xlim([x[i], x[i+1]])

for i, ax_i in enumerate(ax):
    for d in range(len(all_soln)):
        if (d in nd_indices):
            if (d == nd_indices[0]):
                ax_i.plot(objs, all_soln[d, :], color='c', alpha=0.7, label='Nondominated', linewidth=3)
            else:
                ax_i.plot(objs, all_soln[d, :], color='c', alpha=0.7, linewidth=3)
    ax_i.set_xlim([x[i], x[i+1]])

All solutions from the non-dominated set are colored cyan, while the rest of the data is greyed-out. This is an example of brushing. Note that only the first line plotted for both sets are labeled, and that the grey-out data is plotted first. This is so the non-dominated lines are shown clearly over the brushed lines.

4. Write a function to position y-axis tick locations and labels

The set_ticks_for_axis() function is key to this process as it grants you full control over the labeling and tick positioning of your y-axes. It has three inputs:

  • dim: the index of a value from the objs array
  • ax_i: the current axis
  • ticks: the desired number of ticks
def set_ticks_for_axis(dim, ax_i, ticks):
    min_val, max_val, v_range = min_max_range[objs[dim]]
    step = v_range/float(ticks)
    tick_labels = [round(min_val + step*i, 2) for i in range(ticks)]
    norm_min = min(all_soln[:,dim])
    norm_range = np.ptp(all_soln[:,dim])
    norm_step =(norm_range/float(ticks-1))
    ticks = [round(norm_min + norm_step*i, 2) for i in range(ticks)]
    ax_i.yaxis.set_ticks(ticks)
    ax_i.set_yticklabels(tick_labels)

Hello min_max_range! This dictionary essentially makes accessing the extreme values and range of each objective easier and less mistake-prone. It is optional, but I do recommend it.

Overall, this function does two things:

  1. Creates ticks-evenly spaced tick-marks along ax_i.
  2. Labels ax_i with tick labels of size ticks. The tick labels are evenly-spaced values generated by adding step*i to min_val for each iteration i.

A nice thing about this function is that is also preserves the order that the objective values should be placed along the axis, which makes showing a direction of preference easier. It will be used to label each y-axis in our next step.

5. Iterate over and label axes

for dim, ax_i in enumerate(ax):
    ax_i.xaxis.set_major_locator(ticker.FixedLocator([dim]))
    set_ticks_for_axis(dim, ax_i, ticks=10)

FixedLocator() is a subclass of Matplotlib’s ticker class. As it’s name suggests, it fixes the tick locations and prevents changes to the tick label or location that may possibly occur during the iteration. More information about the subclass can be found here.

Here, you only need to label the x-axis with one label and one tick per iteration (hence Line 2). On the other hand, you are labeling the entire y-axis of ax_i, which is where you need to use set_ticks_for_axis().

6. Create a twin axis on the last axis in ax

ax2 = plt.twinx(ax[-1])
dim = len(ax)
ax2.xaxis.set_major_locator(ticker.FixedLocator([x[-2], x [-1]]))
set_ticks_for_axis(dim, ax2, ticks=10)
ax2.set_xticklabels([objs[-2], objs[-1]])

Creating a twin axis using plt.twinx() enables you to label the last axis with y-ticks. Line 3 and 5 ensure that the x-axis is correctly labeled with the last objective name.

7. Finally, plot the figure

plt.subplots_adjust(wspace=0, hspace=0.2, left=0.1, right=0.85, bottom=0.1, top=0.9)
ax[8].legend(bbox_to_anchor=(1.35, 1), loc='upper left', prop={'size': 14})
ax[0].set_ylabel("$\leftarrow$ Direction of preference", fontsize=12)
plt.title("PCP Example", fontsize=12)
plt.savefig("PCP_example.png")
plt.show()

Be sure to remember to label the direction of preference, and one you’ve saved your plot, you’re done!

The source code to generate the following plot can be found here. I hope this makes parallel axis plots a little more understandable and less intimidating.

References

Weitz, D. (2020, July 27). Parallel Coordinates Plots. Retrieved November 09, 2020, from https://towardsdatascience.com/parallel-coordinates-plots-6fcfa066dcb3

Keen, B.A., Parallel Coordinates in Matplotlib. (2017, May 17). Retrieved November 09, 2020, from https://benalexkeen.com/parallel-coordinates-in-matplotlib/

More on subplots with Matplotlib

I got a little ahead of myself with the title of my last post, “Everything you want to know about subplots in Python’s Matplotlib”. While the information in that post can allow you to do quite a lot, there is in fact more that you might want to know. In this post I’ll go over two more subplot tools that are helpful for designing informative and attractive subplots. First, I’ll discuss working with colorbars, a seemly minor task that can be quite time consuming. Then I’ll discuss a brand new feature of Matplotlib, the subplot_mosaic interface. This feature is still in its testing phase, but will likely be the new standard for making subplots in the future.

Formatting colorbars with constrained_layout

Colorbars play an important role in data visualization. Often, you may use a common color to link multiple views of the same data set, or contrast two data sets. Making a colorbar in matplotlib is fairly easy, but unless you use the right tools, making the colorbar fit into the overall graphic can be unexpectedly difficult. Here’s an example of creating a single colorbar for four different subplots. The fig.colorbar() function, allows you to easily add a colorbar to the set of subplots. Unfortunately, with the default settings, this code will shrink two subplots disproportionately.

from matplotlib import pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import numpy as np

# create some sample data
x = np.linspace(-10.0, 10.0, 1000)
y = np.linspace(-10.0, 10.0, 1000)
X, Y = np.meshgrid(x, y)
Z1 = X + Y
Z2 = X - Y
Z3 = X*Y
Z4 = X**2+Y**2

# create a figure object
fig, axes = plt.subplots(2,2, figsize=(8,8))
im1 = axes[0,0].imshow(Z1, cmap='BrBG')
im2 = axes[0,1].imshow(Z2, cmap='BrBG')
im3 = axes[1,0].imshow(Z3, cmap='BrBG')
im4 = axes[1,1].imshow(Z4, cmap='BrBG')

cbar = fig.colorbar(im1, ax=axes[:, 1], shrink=0.8)

One way you may attempt to fix this could be to use the tight_layout() function, which can help align subplots. This will make things worse however, because the colorbar confuses the algorithm that tight_layout uses to arrange the axes objects. The result will look like this (and produce an warning output):

Luckily, there is an alternative to tight_layout, called constrained_layout, which uses a constrained solver to optimize subplot placement. Constrained layout should be called during the creation of the figure object, as demonstrated below. Applying constrained layout to this set of subplots fixes the error and creates a nice looking set of plots.

# create a figure object
fig, axes = plt.subplots(2,2, figsize=(8,8), constrained_layout=True)
im1 = axes[0,0].imshow(Z1, cmap='BrBG')
im2 = axes[0,1].imshow(Z2, cmap='BrBG')
im3 = axes[1,0].imshow(Z3, cmap='BrBG')
im4 = axes[1,1].imshow(Z4, cmap='BrBG')

cbar = fig.colorbar(im1, ax=axes[:, 1], shrink=0.8)

I should note that we can edit the colorbar axes object just like any of the other axes objects, adding labels, adjusting the tick marks etc. For more on colorbars, see the documentation here.

The subplot_mosaic interface

In my last post I discussed the Gridspec interface, which allows you to manually configure custom grid of subplots. While Gridspec can create complex configurations of subplots, manually adjusting the grid can become complicated for complex layouts. Matplotlib recently introduced a new feature, subplot_mosaic, which allows a more intuitive interface for configuring subplots. The subplot_mosaic has a simple and streamlined interface that allows you to easily lay out subplots, then stores these subplots as a dictionary. It’s important to note that this feature is still in testing and (at the time of this posting) is not currently supported by many distributions such as Anaconda. For more examples and full documentation, see the official release page.

Everything you want to know about subplots in Python’s Matplotlib

Sometimes its helpful to get back to basics. I recently created a summer course on data visualization with Python, and the experience made me realize that the workings of Python’s main visualization library, Matplotlib, are often left of out formal Python courses. Over the course of my graduate career I’ve taken several courses on programming and Python, and each time visualization or plotting was viewed as an afterthought. I don’t think my experience is uncommon, creating basic visualizations in Python is fairly straight forward, and a quick Google search will yield tutorials on how to make most common plot types. While constructing the summer course, I realized how much time I could have saved if I had learned how Matplotlib worked earlier in my PhD. With this in mind, I’m writing this post to serve as a guide for those new to plotting with Matplotlib, and to help fill in some gaps for those who are already experienced Matplotlib users.

I’ve chosen to devote this post to making subplots, a task often necessary for scientific visualizations that can be surprisingly frustrating if you don’t understand Matplotlib’s structure. Since Python is an open source language, there are multiple ways of creating and working with subplots, so in this post I’ll outline a few ways that work for me, and provide some context about how things work behind the scenes in Matplotlib.

A brief introduction to Matplotlib’s object oriented syntax

Let’s start with some history. As it’s name suggests, Matplotlib was originally created as a means of replicating Matlab style plotting functionality in Python. As such, one way we can create visualizations using Matplotlib is through “Matlab style” syntax which is contained within Matplotlib’s flagship module, Pyplot. For example, to create a simple line plot we can use the following code:

import matplotlib.pyplot as plt
import numpy as np
x = np.arange(0,10)
y = np.arange(0, 10)

plt.plot(x,y)

Which will generate this plot:

While the Matlab style syntax is easy to use, it is actually quite limiting in its ability to create custom visualizations. Luckily, the Matlab style syntax is simply hiding an object oriented code structure that is at the core of Matplotlib. By directly accessing and working with this object oriented structure we can create highly customized visualizations.

For the purposes of this blog, there are two Matplotlib objects that are important: Figure objects and Axes objects. Figure objects represent the containers that hold a visualization. I think of this as the blank canvas that the plots will be generated on. Figure objects can contain one or multiple Axes objects, each representing an individual chart (this can be a big source of confusion when learning Matplotlib, the term “axes object” is inherited from Matlab and refers to an entire chart rather than a x or y axis) . To create the plot above using Matplotlib’s object oriented syntax we need to make a few small modifications to the original code. First, we’ll generate a Figure object using Pyplot which will automatically create an axes object behind the scenes. Next we access this Axes object using the Figure object’s “gca” method, which stands for “get current axes”. Finally, we’ll use the Axes object to generate the plot:

fig = plt.figure()
ax = plt.gca()
ax.plot(x,y)

This will make the identical plot as above:

Note that we can use the Axes object to create any kind of plot that can be created with the Matlab style syntax. Examples include ax.plot (line plot), ax.scatter (scatter plot), ax.bar (bar plot), ax.imshow (heatmaps and color mesh) etc.

Creating Basic subplots

Now that we have some insight into the object oriented structure of Matplotlib, we can start making some subplots. There are two main ways we can make subplots with Matplotlib, the first is to use Pyplot’s “subplots” function, which allows us to specify the number of rows and columns of plots in your figure. This function will return both a Figure object and a list of Axes objects. Note that the list of Axes objects is arranged in the same way as the subplots are within the figure (i.e. list entry [0,0] is the top left subplot):

fig, [[ax0, ax1],[ax2, ax3]] = plt.subplots(nrows=2, ncols=2)
ax0.text(0.5, 0.5, "This is axes object 0", ha='center')
ax1.text(0.5, 0.5, "This is axes object 1", ha='center')
ax2.text(0.5, 0.5, "This is axes object 2", ha='center')
ax3.text(0.5, 0.5, "This is axes object 3", ha='center')

Which will make the following set of subplots:

Instead of specifying individual names of each axes, we can alternatively store them in a single named list and axes them via indices like this:

fig, axes = plt.subplots(nrows=2, ncols=2)
axes[0,0].text(0.5, 0.5, "This is axes object 0", ha='center')
axes[0,1].text(0.5, 0.5, "This is axes object 1", ha='center')
axes[1,0].text(0.5, 0.5, "This is axes object 2", ha='center')
axes[1,1].text(0.5, 0.5, "This is axes object 3", ha='center')

An alternative way to create subplots is to first make the Figure object on its own, and then add subplots one at a time using the method “add_subplot” from the figure object. This method takes one argument, a three digit number. The first digit represents the number of rows, the second represents the number of columns and the third represents the placement of the individual subplot (the location from left to right, top to bottom, with the first placement being the top left and the last being the lower right. Oddly, this is 1 indexed unlike everything else in the entire Python language):

fig = plt.figure()
ax0 =fig.add_subplot(221)
ax0.text(0.5, 0.5, "This is axes object 0", ha='center')
ax1 =fig.add_subplot(222)
ax1.text(0.5, 0.5, "This is axes object 1", ha='center')
ax2 =fig.add_subplot(223)
ax2.text(0.5, 0.5, "This is axes object 2", ha='center')
ax3 =fig.add_subplot(224)
ax3.text(0.5, 0.5, "This is axes object 3", ha='center')

This script will make the identical set of subplots as shown above:

Giving your plots some room to breath

You may have noticed that the plots above are very cramped. By default, there is very little room between adjacent subplots. One remedy is to use Matplotlib’s tight_layout function, which will automatically fit your subplots into the figure. Just add this one line after creating your subplots. For demonstration, I’ll also add titles and axes labels to each subplot:

fig, [[ax0, ax1],[ax2, ax3]] = plt.subplots(nrows=2, ncols=2)
ax0.text(0.5, 0.5, "This is axes object 0", ha='center')
ax0.set_title('Title 0')
ax0.set_xlabel('X-axis')
ax0.set_ylabel('Y-axis')
ax1.text(0.5, 0.5, "This is axes object 1", ha='center')
ax1.set_title('Title 1')
ax1.set_xlabel('X-axis')
ax1.set_ylabel('Y-axis')
ax2.text(0.5, 0.5, "This is axes object 2", ha='center')
ax2.set_title('Title 2')
ax2.set_xlabel('X-axis')
ax2.set_ylabel('Y-axis')
ax3.text(0.5, 0.5, "This is axes object 3", ha='center')
ax3.set_title('Title 3')
ax3.set_xlabel('X-axis')
ax3.set_ylabel('Y-axis')
plt.tight_layout()
plt.savefig('tight_layout.png')

This will create the following set of plots:

If we want to add some extra padding between subplots, we can add some arguments to override the default tight_layout parameters. The argument “pad” adds padding between subplots and the figure boarders, while “h_pad” and “w_pad” add height and width padding between subplots. The units of this padding are in percentages of the default font size.

plt.tight_layout(pad=1.5, h_pad=2.5, w_pad=2)

Note that to add the necessary padding, tight_layout will make the subplots themselves smaller and smaller. To fix this, we can increase the size of the figure object using the “figsize” argument when we create the Figure object. The units of this function are in inches:

fig, [[ax0, ax1],[ax2, ax3]] = plt.subplots(nrows=2, ncols=2, figsize=(8,6))

Before moving on, I should note that the function: plt.subplots_adjust() has very similar functionality to tight_layout and can allow you to adjust left, right, bottom and top paddings individually. For the sake of brevity I’m omitting it here.

Getting fancy: creating subplots of different sizes

So far we’ve been creating subplots of uniform size. In practice however, it can be helpful to generate plots of varying sizes. We can do this using the Gridspec class. Gridspec will create a grid of subplot locations within a Figure object. When generating subplots, we can assign each a location and size in Gridspec coordinates. Importantly, a single subplot can span multiple rows or columns of Gridspec coordinates. Below, I’ll make a 2×3 Gridspec and use it to create different sized subplots.

fig = plt.figure(figsize=(8,6))
gspec = fig.add_gridspec(nrows=2, ncols=3)

# the first subplot will span one row and two columns
# it will start at the top left
ax0 = fig.add_subplot(gspec[0,:2])
ax0.text(0.5, 0.5, "This is axes object 0, \ngspec coordinates [0,:2]", ha='center')

# the second subplot will span one row and one column
# it will start at the bottom left
ax1 = fig.add_subplot(gspec[1,0])
ax1.text(0.5, 0.5, "This is axes object 1, \ngspec coordinates [1,0]", ha='center')

# the third subplot will span one row and one column
# it will start at the bottom middle
ax2 = fig.add_subplot(gspec[1,1])
ax2.text(0.5, 0.5, "This is axes object 2, \ngspec coordinates [1,1]", ha='center')

# the fourth subplot will span two rows and one column
# it will start at the top right
ax3 = fig.add_subplot(gspec[:,2])
ax3.text(0.5, 0.5, "This is axes object 3, \ngspec coordinates [:,2]", ha='center')

plt.tight_layout()

Which will make this plot:

We can also customize the height and width of each Gridspec coordinate using the arguments “height_ratios” and “width_ratios” respectively when creating the Gridspec object.

fig = plt.figure(figsize=(8,6))
# parameters to specify the width and height ratios between rows and columns
widths= [1, 1.5, 2]
heights = [1, .5]

gspec = fig.add_gridspec(ncols=3, nrows=2, width_ratios = widths, height_ratios = heights)

# the first subplot will span one row and two columns
# it will start at the top left
ax0 = fig.add_subplot(gspec[0,:2])
ax0.text(0.5, 0.5, "This is axes object 0, \ngspec coordinates [0,:2]", ha='center')

# the second subplot will span one row and one column
# it will start at the bottom left
ax1 = fig.add_subplot(gspec[1,0])
ax1.text(0.5, 0.5, "This is axes object 1, \ngspec coordinates [1,0]", ha='center')

# the third subplot will span one row and one column
# it will start at the bottom middle
ax2 = fig.add_subplot(gspec[1,1])
ax2.text(0.5, 0.5, "This is axes object 2, \ngspec coordinates [1,1]", ha='center')

# the fourth subplot will span two rows and one column
# it will start at the top right
ax3 = fig.add_subplot(gspec[:,2])
ax3.text(0.5, 0.5, "This is axes object 3, \ngspec coordinates [:,2]", ha='center')

plt.tight_layout()

Final thoughts

There are many more ways we can customize subplots in Matplotlib, but the material in this post suits my needs 99% of the time. For further reading, check out the Matplotlib documentation and examples: https://matplotlib.org/gallery/subplots_axes_and_figures/subplots_demo.html#sphx-glr-gallery-subplots-axes-and-figures-subplots-demo-py

Spatial and temporal visualization of water demands in a basin

One of my main projects in the last couple years has been in the Upper Colorado River Basin, where we’ve been investigating how hundreds of water users in the basin might be affected by a variety of different changes and uncertainties that might take place in the region. Being in Colorado, water allocation in the basin follows prior-appropriation, where every user has a certain water right, defined by its seniority (where more senior = better) and its decree (i.e. how much water the right is granted for extraction). For the different users in the basin to receive water for their respective uses, prior-appropriation determines who gets X amount of water first based on seniority and given water availability, and then repeats down the seniority order until all requested water has been allocated. Hence, no user can extract water in a manner that affects any senior to them user.

During this investigation, we’ve been interested in seeing how this actually plays out through time and space in the basin, with the aim of potentially better understanding any consequential relationships that might exist between different users, as well as any emerging patterns that might be missed by looking at the data in a different manner. I tried to write a little script to do this in Python. I will be visualizing how users along the basin requested for some water at some historical month (the demand) and how much of that demand was actually met (the shortage), based on their right seniority and water availability in the basin.

There have been multiple posts in the blog on generating maps in Python (as well as in other languages), and they all use a module called Basemap which has been the most popular for these things, but it’s kinda buggy, and kinda a pain to install, and kinda a pain to get working, and I spent the better part of an entire workday to re-set it up on my machine and couldn’t. Enter Cartopy. After Basemap was announced deprecated, Cartopy was meant to be its replacement so I decided to transition. It was super easy to install and start generating maps within a couple minutes and the code I’ll be sharing today will be using that. I will also be using matplotlib’s animation classes to capture the water allocation to the different users through time in a video or a GIF.

First, I load up all necessary packages and data. structures contains the X and Y coordinates of all the diversion points; demands and shortages contain monthly data of water demand and shortage for each diversion point.

import numpy as np
import cartopy.feature as cpf
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import cartopy.io.img_tiles as cimgt
import pandas as pd
import matplotlib.animation as animation
import math

structures = pd.read_csv('modeled_diversions.csv',index_col=0)
demands = pd.read_csv('demands.csv',index_col=0)
shortages = pd.read_csv('shortages.csv',index_col=0)

Then, I set up the extent of my map (i.e., the region I would like to show). rivers_10m loads the river “feature” at a 10m resolution. There’s a lot of different features that can be added (coastlines, borders, etc.). Finally, I load the tiles which is basically the background map image (many other options also).

extent = [-109.069,-105.6,38.85,40.50]
rivers_10m = cpf.NaturalEarthFeature('physical', 'rivers_lake_centerlines', '10m')
tiles = cimgt.StamenTerrain()

I draw the figure more or less as I would in matplotlib, using the matplotlib scatter to draw my demand and shortage points. The rest of the lines are basically legend customization by creating dummy artists to show max demands and shortages in the legend.

fig = plt.figure(figsize=(12, 6))
ax = plt.axes(projection=tiles.crs)
ax.add_feature(rivers_10m, facecolor='None', edgecolor='b')
ax.add_image(tiles, 9, interpolation='none')
ax.set_extent(extent)
dem_points = ax.scatter(structures['X'], structures['Y'], marker = '.', s = demands['0']/50, c = 'dodgerblue', transform=ccrs.Geodetic())
short_points = ax.scatter(structures['X'], structures['Y'], marker = '.', s = shortages['0']/50, c = 'coral' ,transform=ccrs.Geodetic())
l2 = ax.scatter(-110,37, s=demands.values.max()/50, c = 'dodgerblue', transform=ccrs.Geodetic())
l4 = ax.scatter(-110,37, s=shortages.values.max()/50, c = 'coral',transform=ccrs.Geodetic())
dem_label = ax.scatter(-110,37, s=0, transform=ccrs.Geodetic())
short_label = ax.scatter(-110,37, s=0, transform=ccrs.Geodetic())
labels = ['Max Demand' , str(demands.values.max()) + ' af', 
          'Max Shortage' , str(shortages.values.max()) + ' af']
legend = ax.legend([dem_label, l2, short_label, l4], labels, ncol=2, loc = 'upper left', title = 'Month: '+ str((0 + 10) % 12 +1) + '/' + str(int(math.floor(0/12))+1908)+'\n', fontsize=10, title_fontsize = 14, borderpad=2, handletextpad = 1.3)

This code should produce something like the following, which shows the relative demand across users in blue, as well as how much of that demand was not met (shortage) in orange for November 1908. The large circles in the legend show the max demand and shortage across all users across all months in the record for reference.

To animate this, it’s very simple. All we need to create is another function (in this case update_points) that will define what changes at every frame of the animation. I’ve defined my function to adjust the size of every circle according to the timestep/frame, as well as change the title of the legend to the correct month. Matplotlib’s FuncAnimation then uses that and my figure to update it repeatedly (in this case for 120 steps). Finally, the animation can be saved to a video.

def update_points(num, dem_points, short_points, legend):
    dem_points.set_sizes(demands[str(num)]/10)
    short_points.set_sizes(shortages[str(num)]/10)
    legend.set_title('Month: '+ str((num + 10) % 12 +1) + '/' + str(int(math.floor(num/12))+1908))
    return dem_points, short_points, legend 
       
anim = animation.FuncAnimation(fig, update_points, 120, fargs=(dem_points, short_points, legend),
                                   interval=200, blit=False)
anim.save('basin_animation.mp4', fps=10,  dpi=150, extra_args=['-vcodec', 'libx264'])
WordPress reduces resolution, full res can be found here: https://imgur.com/a/6zfYIDU

There’s a lot to be added and improved, but from this simple version we can immediately see certain diversions popping out as well as geographical regions that exhibit frequent shortage. I will continue working on this and hopefully share improved versions in the future.

Intro to Machine Learning Part 6: Gaussian Naive Bayes and Logistic Regression

Machine Learning problems often involve binary classification, which seeks to use a data point’s features, x, to correctly predict its label, y. In my last post I discussed binary classification with Support Vector Machines (SVM), which formulates the classification problem as a search for the maximum margin hyperplane that divides two classes. Today we’ll take different view on binary classification, we’ll use our training set to construct P(y|x), the probability of class y given a set of features x and classify each point by determining which class it is more likely to be. We’ll examine two algorithms for that use different strategies for estimating P(y|x), Naïve Bayes and Logistic regression. I’ll demonstrate the two classifiers on an example data set I’ve created, shown in Figure 1 below. The data set contains features X = (X1, X2) and  labels Y∈ (+1,-1),  positive points are shown as blue circles and negative as red triangles. This example was inspired by an in class exercise in CS 5780 at Cornell, though I’ve created this data set and set of code myself using python’s scikit-learn package.

raw_points

Figure 1: Example training set

 

Gaussian Naïve Bayes

Naïve Bayes is a generative algorithm, meaning that it uses a set of training data to generate P(x,y) and then uses Bayes Rule to find P(y|x):

P(y|x)=\frac{P(x|y)P(y)}{P(x)}                                (1)

A necessary condition for equation 1 to hold is the Naïve Bayes assumption, which states that feature values are independent given the label. While this is a strong assumption, it turns out that using this assumption can create effective classifiers even if it is violated.

To use Bayes rule to construct a classifier, we need a second assumption regarding the conditional distribution of each feature x on each label y. Here we’ll use a Gaussian distribution such that:

P(x|y) ~ N(\mu_y, \Sigma_y)                                                                                   (2)

Where \Sigma_y is a diagonal covariance matrix with [\Sigma_y]_{\alpha,\alpha}=\sigma^2_{\alpha, y} for each feature \alpha.

For each feature, $\alpha$, and each class, c we can then model P(x_\alpha|y) as:

P(x_\alpha|y=c) ~ N(\mu_{\alpha c},\sigma^2_{\alpha c})=\frac{1}{\sqrt{2\pi}\sigma_\alpha c}e^{-\frac{1}{2}(\frac{x_\alpha-\mu_{\alpha c}}{\sigma_{\alpha c}})^{2}}                              (3)

We can then estimate model parameters:

\mu_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)x_{i \alpha}                                                                   (4)

\sigma^2_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)(x_{i \alpha}-\mu_{\alpha c})^2                                                  (5)

Where:

n_c = \sum^{n}_{i=1}I(y_i=c)                                                                                (6)

Parameters can be estimated with Maximum likelihood estimation (MLE) or maximum a posteriori estimation (MAP).

Once we have fit the conditional Gaussian model to our data set, we can derive a linear classifier, a hyperplane that separates the two classes,  which takes the following form:

P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}                                                                             (7)

Where w is a vector of coefficients that define the separating hyperplane and b is the hyperplane’s intercept. W and b are functions of the Gaussian moments derived in equations 4 and 5. For a full derivation of the linear classifier starting with the Naive Bayes assumption, see the excellent course notes from CS 5780.

Logistic Regression

Logistic regression is the discriminative counterpart to Naive Bayes, rather than modeling P(x,y) and using it to estimate P(y|x), Logistic regression models P(y|x) directly:

P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}                                                                              (8)

Logistic regression uses MLE or MAP to directly estimate the parameters of the separating hyperplane, w and b rather than deriving them from the moments of P(x,y). Rather than seeking to fit parameters that best describe the test data, logistic regression seeks to fit a hyperplane that best separates the test data. For derivation of MLE and MAP estimates of logistic regression parameters, see the class notes from CS 5780.

Comparing Gaussian Naive Bayes and Logistic Regression

Below I’ve plotted the estimated classifications by the two algorithms using the Scikit-learn package in Python. Results are shown in Figure 2.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
import seaborn as sns
sns.set(style='whitegrid')

## create a test data set ##
pos = np.array([[1,5], [1,7], [1,9], [2,8], [3,7], [1,11], [3,3], \
[5,5], [4,8], [5,9], [2,6], [3,9], [4,4]])
neg = np.array([[4,1], [5,1], [3,2], [2,1], [8,4], [6,2], [5,3], \
[4,2], [7,1], [5,4], [6,3], [7,4], [4,3], [5,2], [8,5]])
all_points = np.concatenate((pos,neg), 0)
labels = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1])

## compare Naive Bayes and Logistic Regression ##

# Fit Naive Bayes
gnb = GaussianNB()
gnb.fit(all_points, labels)

# make NB predictions and plot
x1_mesh, x2_mesh = np.meshgrid(np.arange(0,11,1), np.arange(0,11,1))
Y_NB = gnb.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_NB = Y_NB.reshape(x1_mesh.shape)

fig1, axes = plt.subplots(1,2, figsize=(10,4))

axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1')
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')
#plt.legend(['Positive Points', 'Negative Points'], scatterpoints=1)
#.savefig('NB_classification.png', bbox_inches='tight')

# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(all_points, labels)

# Make predictions and plot
Y_LR = lr.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_LR = Y_LR.reshape(x1_mesh.shape)

axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1'); 
axes[1].set_ylabel('X2'); axes[1].set_title("Logistic Regression")
plt.savefig('compare_classification.png', bbox_inches='tight')

 

 

compare_classification

Figure 2: Example classification with Gaussian Naive Bayes (left) and Logistic regression. Blue shaded areas represent a prediction of positive labels for the data points, the red shaded areas represent predictions of negative labels.

Figure 2 illustrates an important difference in the treatment of outliers between the two classifiers. Gaussian Naive Bayes assumes that points close to the centroid of class are likely to be members of that class, which leads it to mislabel positive training points with features (3,3), (4,4) and (5,5). Logistic regression on the other hand is only concerned with correctly classifying points, so the signal from the outliers is more influential on its classification.

So which algorithm should you use? The answer, as usual, is that it depends. In this example, logistic regression is able to correctly classify the outliers with positive labels while Naïve Bayes is not. If these points are indeed an indicator of the underlying structure of positive points, then logistic regression has performed better. On the other hand, if they are truly outliers, than Naïve Bayes has performed better. In general, Logistic Regression has been found to outperform Naïve Bayes on large data sets but is prone to over fit small data sets. The two algorithms will converge asymptotically if the Naïve Bayes assumption holds.

Visualizing P(y|x)

One advantage to these methods for classification is that they provide estimates of P(y|x), whereas other methods such as SVM only provide a separating hyperplane. These probabilities can be useful in decision making contexts such as scenario discover for water resources systems, demonstrated in Quinn et al., 2018. Below, I use scikit-learn to plot the classification probabilities for both algorithms.

# plot Naive Bayes predicted probabilities
fig2, axes = plt.subplots(1,2, figsize=(12,4))
axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1'); 
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')

# plot Logistic Regression redicted probabilities
LRcont = axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1')
axes[1].set_ylabel('X2'); axes[1].set_title('Logistic Regression')
cb = fig2.colorbar(LRcont, ax=axes.ravel().tolist())
cb.set_label('Probability of Positive Classification')
cb.set_ticks([0, .25, .5, .75, 1])
cb.set_ticklabels(["0", "0.25", "0.5", "0.75", "1.0"])
plt.savefig('compare_probs.png', bbox_inches='tight')

compare_probs

Figure 3: Conditional probabilities P(y|x) generated by Naive Bayes (left) and Logistic Regression.

Further reading

This post has focused on Gaussian Naive Bayes as it is the direct counterpart of Logistic Regression for continuous data. It’s important to note however, that Naive Bayes frequently used on data with binomial or multinomial features. Examples include spam filters and language classifiers. For more information on Naive Bayes in these context, see these notes from CS 5780.

As mentioned above, logistic regression has been for scenario discovery in water resources systems, for more detail and context see Julie’s blog post.

References

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Course Notes from MIT: https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.Logistic

Course Notes from Cornell: http://www.cs.cornell.edu/courses/cs4780/2018fa/syllabus/index.html

Quinn, J. D., Reed, P. M., Giuliani, M., Castelletti, A., Oyler, J. W., & Nicholas, R. E. (2018). Exploring how changing monsoonal dynamics and human pressures challenge multireservoir management for flood protection, hydropower production, and agricultural water supplyWater Resources Research54, 4638–4662. https://doi.org/10.1029/2018WR022743

Interacting with Plotted Functions Using Jupyter Notebooks ipywidgets with matplotlib

Interacting with Plotted Functions Using Jupyter Notebooks ipywidgets with matplotlib

When exploring the properties of an individual function or a system of equations, I often find myself quickly writing up code in a Jupyter Notebook (overview) to plot functions using matplotlib. While a previous blogpost by Jazmin has gone over creating interactive plots with matplotlib, being able to actively change variables to quickly explore their sensitivity or even teach others is crucial.

The primary reason I started playing with interactive widgets in Jupyter Notebook was to  teach individuals the basic mechanics of a simple crop allocation model that utilizes Positive Mathematical Programming (PMP) (Howitt, 1995). Beyond showing the simple calibration steps via code, allowing students to interact with individual variables allows for them to not only explore the sensitivity of models but also to find where such a model might break.

In this specific case, the marginal cost and revenues for a specific commodity (corn) versus the number of acres planted are graphed. A breaking case where the crop’s input data (the price per ton of corn) causes the model to produce unrealistic results (a negative marginal cost) is produced below.

Plotting Functions in matplotlib

One of the features that is often overlooked in matplotlib is that you can (indirectly) plot functions without much effort. The following commented code is used to demonstrate plotting marginal cost and revenue curves on the same graph.

import matplotlib.pyplot as plt
from numpy import *

#fixed inputs from calibrated model
#alpha and gamma for quadratic formulation from Howitt, 1995
input_params = [743.0, 2.57]

#crop revenue per acre, known
crop_revenues = 1500

#Creating the x-axis domain, controls range of x inputs
t = linspace(0, 350)

#t is the independent variable, corn_MC is dependent variable
corn_MC = input_params[0]  + input_params[1] * t 

#note that you have to multiply t by 0 to populate entire line
corn_MR = crop_revenues + 0 * t

plt.figure(dpi=400) #set resolution of graph, can change. 

#label axes
plt.ylabel('Marginal Cost ($/acre)')
plt.xlabel('Acres of Corn Planted')

#plot as many lines as you'd like, expand them here
plt.plot(t, corn_MC, 'r', label='Marginal Cost')
plt.plot(t, corn_MR, 'b', label='Marginal Revenue')

#change size of ticks on axes
plt.tick_params(labelsize=8)

legend = plt.legend(loc=4, shadow=True)
plt.show()

 

The resulting graph is generated inline in a Jupyter Notebook. Note that the resolution and size of the graphic is adjusted using plot.figure(dpi=400)—the default is much smaller and makes for hard-to-read graphs!

matplotlib_output.png

Interactive Inputs for Graphics in Jupyter Notebook

While graphing a function in Jupyter Notebook using matplotlib is rather straightforward, interactive plots with functions are rarely utilized. Using iPython widgets allows us to easily create these tools for any type of situation, from vertical and horizontal slider bars to progress bars to dropdown to boolean ‘click me’ buttons. You can find any of these features here. For the sake of simplicity, I have simply included three slider bars to demonstrate these features.

Notably, as an expansion of the example above, I have to run scipy.minimize multiple times whenever inputs are recalibrated (for questions on this end, I am more than happy to explain PMP if you contact me directly). To do this, I have to imbed multiple functions into a larger function to carry this out. However, I have not noticed any considerable slowdowns when running this much code.

To carry out the entirety of the example above, I have almost 300 lines of code to fully encapsulate a 3-crop allocation problem. For simplicity’s sake, I defined the function ‘pmp_example_sliders’ to contain all of the code required to calibrate and produce the graphic above (note that input parameters are what are calculated) using Scipy.optimize.minimize. Shadow prices are then derived using scipy.optimize.minimize and used to create the parameters necessary for the  marginal cost curve shown in this example. Note that the full model can be found in the GitHub file.

from ipywidgets import *
import scipy.optimize
%matplotlib inline
from numpy import *
import matplotlib.pyplot as plt

def pmp_example_sliders(corn_price=250, corn_initial_acres=200, x_axis_scale=350):
return plot
###Code for Running Calibration and Graphing Resulting Graph###
#create sliders for three variables to be changed
#have to have function with changing parameters (shown above)
#note that scale for corn_price is form 10 to 500 and stepping by 10
slider = interactive(pmp_example_sliders, corn_price=(10, 500, 10),
corn_initial_acres=(0.01,400, 10), x_axis_scale=(30, 650, 10))

#display the resulting slider
display(slider)

The following graphic results with the aforementioned sliders:

interact_1.PNG

We can easily change the input slider on the corn price from 250 to 410 to change to the following results:

interact_2.PNG

Because of this interactive feature, we not only see how the MC and MR curves interact, but we can observe that the Alpha Parameter (shown below the graph) becomes negative. While not important for this specific blog post, it shows where the crop allocation model produces negative marginal costs for low allocations of corn. Thus, utilizing interactive graphics can help highlight dynamics and drawbacks/limitations for models, helping assist in standalone tutorials.

If you would like to run this yourself, please feel free to download the Jupyter Notebooks from my GitHub. All code is written in Python 3. If you need assistance opening Jupyter Notebook, please refer to my blog post introducing Jupyter Notebook (link) or how to create a shortcut to launch Jupyter Notebook from your current working directory (link).

Fitting Hidden Markov Models Part II: Sample Python Script

This is the second part of a two-part blog series on fitting hidden Markov models (HMMs). In Part I, I explained what HMMs are, why we might want to use them to model hydro-climatological data, and the methods traditionally used to fit them. Here I will show how to apply these methods using the Python package hmmlearn using annual streamflows in the Colorado River basin at the Colorado/Utah state line (USGS gage 09163500). First, note that to use hmmlearn on a Windows machine, I had to install it on Cygwin as a Python 2.7 library.

For this example, we will assume the state each year is either wet or dry, and the distribution of annual streamflows under each state is modeled by a Gaussian distribution. More states can be considered, as well as other distributions, but we will use a two-state, Gaussian HMM here for simplicity. Since streamflow is strictly positive, it might make sense to first log-transform the annual flows at the state line so that the Gaussian models won’t generate negative streamflows, so that’s what we do here.

After installing hmmlearn, the first step is to load the Gaussian hidden Markov model class with from hmmlearn.hmm import GaussianHMM. The fit function of this class requires as inputs the number of states (n_components, here 2 for wet and dry), the number of iterations to run of the Baum-Welch algorithm described in Part I (n_iter; I chose 1000), and the time series to which the model is fit (here a column vector, Q, of the annual or log-transformed annual flows). You can also set initial parameter estimates before fitting the model and only state those which need to be initialized with the init_params argument. This is a string of characters where ‘s’ stands for startprob (the probability of being in each state at the start), ‘t’ for transmat (the probability transition matrix), ‘m’ for means (mean vector) and ‘c’ for covars (covariance matrix). As discussed in Part I it is good to test several different initial parameter estimates to prevent convergence to a local optimum. For simplicity, here I simply use default estimates, but this tutorial shows how to pass your own. I call the model I fit on line 5 model.

Among other attributes and methods, model will have associated with it the means (means_) and covariances (covars_) of the Gaussian distributions fit to each state, the state probability transition matrix (transmat_), the log-likelihood function of the model (score) and methods for simulating from the HMM (sample) and predicting the states of observed values with the Viterbi algorithm described in Part I (predict). The score attribute could be used to compare the performance of models fit with different initial parameter estimates.

It is important to note that which state (wet or dry) is assigned a 0 and which state is assigned a 1 is arbitrary and different assignments may be made with different runs of the algorithm. To avoid confusion, I choose to reorganize the vectors of means and variances and the transition probability matrix so that state 0 is always the dry state, and state 1 is always the wet state. This is done on lines 22-26 if the mean of state 0 is greater than the mean of state 1.


from hmmlearn.hmm import GaussianHMM

def fitHMM(Q, nSamples):
    # fit Gaussian HMM to Q
    model = GaussianHMM(n_components=2, n_iter=1000).fit(np.reshape(Q,[len(Q),1]))
    
    # classify each observation as state 0 or 1
    hidden_states = model.predict(np.reshape(Q,[len(Q),1]))

    # find parameters of Gaussian HMM
    mus = np.array(model.means_)
    sigmas = np.array(np.sqrt(np.array([np.diag(model.covars_[0]),np.diag(model.covars_[1])])))
    P = np.array(model.transmat_)

    # find log-likelihood of Gaussian HMM
    logProb = model.score(np.reshape(Q,[len(Q),1]))

    # generate nSamples from Gaussian HMM
    samples = model.sample(nSamples)

    # re-organize mus, sigmas and P so that first row is lower mean (if not already)
    if mus[0] > mus[1]:
        mus = np.flipud(mus)
        sigmas = np.flipud(sigmas)
        P = np.fliplr(np.flipud(P))
        hidden_states = 1 - hidden_states

    return hidden_states, mus, sigmas, P, logProb, samples

# load annual flow data for the Colorado River near the Colorado/Utah state line
AnnualQ = np.loadtxt('AnnualQ.txt')

# log transform the data and fit the HMM
logQ = np.log(AnnualQ)
hidden_states, mus, sigmas, P, logProb, samples = fitHMM(logQ, 100)

Okay great, we’ve fit an HMM! What does the model look like? Let’s plot the time series of hidden states. Since we made the lower mean always represented by state 0, we know that hidden_states == 0 corresponds to the dry state and hidden_states == 1 to the wet state.


from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

def plotTimeSeries(Q, hidden_states, ylabel, filename):

    sns.set()
    fig = plt.figure()
    ax = fig.add_subplot(111)

    xs = np.arange(len(Q))+1909
    masks = hidden_states == 0
    ax.scatter(xs[masks], Q[masks], c='r', label='Dry State')
    masks = hidden_states == 1
    ax.scatter(xs[masks], Q[masks], c='b', label='Wet State')
    ax.plot(xs, Q, c='k')
    
    ax.set_xlabel('Year')
    ax.set_ylabel(ylabel)
    fig.subplots_adjust(bottom=0.2)
    handles, labels = plt.gca().get_legend_handles_labels()
    fig.legend(handles, labels, loc='lower center', ncol=2, frameon=True)
    fig.savefig(filename)
    fig.clf()

    return None

plt.switch_backend('agg') # turn off display when running with Cygwin
plotTimeSeries(logQ, hidden_states, 'log(Flow at State Line)', 'StateTseries_Log.png')

Wow, looks like there’s some persistence! What are the transition probabilities?


print(model.transmat_)

Running that we get the following:

[[ 0.6794469   0.3205531 ]
[ 0.34904974  0.65095026]]

When in a dry state, there is a 68% chance of transitioning to a dry state again in the next year, while in a wet state there is a 65% chance of transitioning to a wet state again in the next year.

What does the distribution of flows look like in the wet and dry states, and how do these compare with the overall distribution? Since the probability distribution of the wet and dry states are Gaussian in log-space, and each state has some probability of being observed, the overall probability distribution is a mixed, or weighted, Gaussian distribution in which the weight of each of the two Gaussian models is the unconditional probability of being in their respective state. These probabilities make up the stationary distribution, π, which is the vector solving the equation π = πP, where P is the probability transition matrix. As briefly mentioned in Part I, this can be found using the method described here: π = (1/ Σi[ei])e in which e is the eigenvector of PT corresponding to an eigenvalue of 1, and ei is the ith element of e. The overall distribution for our observations is then Y ~ π0N(μ0,σ02) + π1*N(μ1,σ12). We plot this distribution and the component distributions on top of a histogram of the log-space annual flows below.


from scipy import stats as ss

def plotDistribution(Q, mus, sigmas, P, filename):

    # calculate stationary distribution
    eigenvals, eigenvecs = np.linalg.eig(np.transpose(P))
    one_eigval = np.argmin(np.abs(eigenvals-1))
    pi = eigenvecs[:,one_eigval] / np.sum(eigenvecs[:,one_eigval])

    x_0 = np.linspace(mus[0]-4*sigmas[0], mus[0]+4*sigmas[0], 10000)
    fx_0 = pi[0]*ss.norm.pdf(x_0,mus[0],sigmas[0])

    x_1 = np.linspace(mus[1]-4*sigmas[1], mus[1]+4*sigmas[1], 10000)
    fx_1 = pi[1]*ss.norm.pdf(x_1,mus[1],sigmas[1])

    x = np.linspace(mus[0]-4*sigmas[0], mus[1]+4*sigmas[1], 10000)
    fx = pi[0]*ss.norm.pdf(x,mus[0],sigmas[0]) + \
        pi[1]*ss.norm.pdf(x,mus[1],sigmas[1])

    sns.set()
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.hist(Q, color='k', alpha=0.5, density=True)
    l1, = ax.plot(x_0, fx_0, c='r', linewidth=2, label='Dry State Distn')
    l2, = ax.plot(x_1, fx_1, c='b', linewidth=2, label='Wet State Distn')
    l3, = ax.plot(x, fx, c='k', linewidth=2, label='Combined State Distn')

    fig.subplots_adjust(bottom=0.15)
    handles, labels = plt.gca().get_legend_handles_labels()
    fig.legend(handles, labels, loc='lower center', ncol=3, frameon=True)
    fig.savefig(filename)
    fig.clf()

    return None

plotDistribution(logQ, mus, sigmas, P, 'MixedGaussianFit_Log.png')

Looks like a pretty good fit – seems like a Gaussian HMM is a decent model of log-transformed annual flows in the Colorado River at the Colorado/Utah state line. Hopefully you can find relevant applications for your work too. If so, I’d recommend reading through this hmmlearn tutorial, from which I learned how to do everything I’ve shown here.