Clustering basics and a demonstration in clustering infrastructure pathways

Introduction

When approaching a new dataset, the complexity may be daunting. If patterns in the data are not readily apparent, one potential first-step would be to cluster the data and search for groupings. Clustering data, or cluster analysis, can reveal relationships between the observations and provide insight into the data.

Figure: Graphical representation of how clustering can be used to identify patterns in a data set.

Motivation

The outcome of a cluster analysis can be highly dependent upon the clustering algorithm being used, the clustering criteria, and the specified number of clusters to be found. Anticipating these influences and how to manipulate them will set you up for success in your data analysis.

In this post, I introduce some fundamental clustering algorithms, the hierarchical and K-Means clustering algorithms. Then I will share some important considerations, such as how to select a suitable number of clusters. Finally, I will demonstrate the power of clustering by applying the methods to an infrastructure pathways dataset, to cluster infrastructure pathways generated for 1,000 different states of the world into different sets of ‘light’, ‘moderate’, and ‘heavy’ infrastructure development.

Let’s begin!

Methods

Hierarchical Clustering

Hierarchical clustering can be performed “bottom-up”, through an agglomerative approach, which starts by placing each observation in its own cluster and then successively links the observations into increasingly larger clusters. Alternatively, the clustering can take a “top-down”, or divisive, approach which begins with all observations within a single cluster and partitions the set into smaller clusters until each observation exists within its own cluster.

Figure: A graphical representation of the hierarchical clustering process. Red circles denote clusters. If read from left to right, then the figure is representative of the agglomerative approach. If read from right to left, then the figure is representative of the divisive approach.

Specifying how clusters are preferentially assigned will influence the final shape and contents of the clusters. When performing hierarchical clustering, the user must specify the linkage criteria, which determines how the distance between clusters is measured, and consequently which clusters will be group during the next iteration. Three common linkage criteria are:

Linkage TypeDescriptionFormulation
Complete (max)the maximum distance between a group
and observations in another group.
max{ d(a, b) : a in A, b in B}
Single (min)the minimum distance between a group
and observations in another group.
max{ d(a, b) : a in A, b in B}
Averagethe average distance between a group and
observations in another group.
average{ d(a, b) : a in A, b in B}

Additionally, when calculating the linkage criteria, different methods of measuring the distance can be used. For example, when working with correlated data, it may be preferable to use the Mahalanobis distance over the regular Euclidean norm.

The results of the clustering analysis can be viewed through a dendrogram, which displays a tree-like representation of the cluster structure. For more detail on dendrograms, and a step-by-step tutorial for creating color-coded dendrograms in R, see Rohini’s (2018) post here.

K-Means Clustering

The K-means algorithm follows three simple steps.

  1. Selection of k centroids within the sample space.

In the traditional naïve K-Means algorithms, centroids do not correspond to specific observations within the space. They can be randomly selected, or chosen through more informed methods (see, for example, K-Means++)

2. Assignment of each observation to the cluster with the nearest centroid.

3. Re-define each cluster centroid as the mean of the data points assigned to that centroid.

Steps 2 and 3 of the algorithm are repeated until the centroids begin to stabilize, such that subsequent iterations produce centroids located within some small, tolerable distance from one another.

Figure: A graphical representation of the k-means clustering process, with 3 clusters. X’s denote the centroids for that respective cluster. The dashed arrows denote the shift in the location of the centroid during the next iteration. Points are colored according to their assigned cluster during that iteration.

The centroids toward which the algorithm converges may be a locally optimal clustering solution, and can be highly dependent upon the initial centroid selection. Often, this sensitivity is handled by performing the clustering processes several times, and choosing the set of clusters which yield the smallest within-cluster sum of squares.

Selecting the number of clusters

In some contexts, it is possible to know how many clusters are required prior to the analysis. For example, when working with Fisher’s famous iris flower data set, three clusters are needed because there are three types of flowers observed within the data.

In an exploratory context, however, the appropriate number of clusters may not be readily apparent. One method of determining the preferred number of clusters is by testing multiple different values and evaluating the separation distances between the clusters. If the number of clusters is well-chosen, then the clusters will be both well-spaced and the intra-cluster distance will be small.

The Silhouette Coefficient, or silhouette score, is one method of measuring the goodness of fit for a set of clusters, and takes into consideration both the spacing between clusters (inter-cluster distance) and the distance between points within a cluster (intra-cluster spacing). The Silhouette scores is defined as:

Where:

a = average intra-cluster distance,

b = average inter-cluster distance.

The silhouette score ranges from [-1, 1]. A value near +1 suggests that the clusters are far from one another, with a small distances within the cluster, and thus the chosen number of clusters is well-suited for the data. A value near -1, however, suggests that the clusters are poorly describing the patterns in the data.

Example: Clustering Infrastructure Pathways

In water supply planning contexts, infrastructure pathways describe the sequencing of infrastructure investment decisions throughout time, in response to changes in system (construction of new reservoirs, expansions to water treatment capacity etc.). Adaptive infrastructure pathways use a risk-of-failure (ROF) based set of policy rules to trigger new infrastructure development. Adaptive infrastructure pathways have been shown to effectively coordinate water infrastructure decisions in complex multi-actor systems (Zeff et al., 2016). While the benefits of this adaptive strategy are clear, ROF based pathways also bring new challenges for decision makers – rather than specifying an single sequence of infrastructure investments, an adaptive rule system will generate a unique sequence of investments for every future state of the world. To understand how a rule system behaves, decision makers must have tools to visualize an ensemble of infrastructure pathways. This is where cluster analysis comes in. Using clustering, we can group similar infrastructure sequences together, allowing decision makers to summarize how a candidate infrastructure investment policy will behave across many future states of the world.

Here, I analyze a set of infrastructure pathways generated for a hypothetical water utility with four infrastructure options:

  • small water treatment plant expansion
  • large water treatment plant expansion
  • construction of a small new run of river intake
  • construction of a large new run of river intake

The utility is examining a candidate policy and has generated pathways for 1,000 different states of the world, or realizations, which are each characterized by unique hydrologic and population dynamics. Given the 1,000 different realizations, each with a unique infrastructure pathway, I am interested in clustering the pathways according the the timing of the infrastructure construction decisions to understand how the policy behaves.

The dataset used for this demonstration is courtesy of David Gold, who is also responsible for providing significant portions of the following code and assistance along the way. Thanks David!

Follow along with this demonstration by downloading .txt data file HERE!

Let’s begin by loading in the data.

import pandas as pd

# Load in data
pathways_df = pd.read_csv('./ClusterData.txt', sep = '\t', header = None)

The data contains information describing the timing of infrastructure construction for 1,000 different simulated realizations. Each column represents a different infrastructure option, and each row is 1 of the 1,000 different simulated realizations. The contents of each cell denote a standardized value corresponding to the timing of the infrastructure construction within each realization, ranging from [0, 1]. Infrastructure options containing the value 1 were not built during the 45-year simulation period, in that specific realization.

RealizationInfra. Opt. #1Infra. Opt. #2Infra. Opt. #M
1W1,1W1,2W1,M
2W2,1W2,2W2,M
NWN,2WN,2WN,M
Table: Example output data from the infrastructure pathways optimization process; within each realization (state of the world (SOW)) infrastructure construction decisions are made during different weeks within the simulation period (45-years), considering a subset of infrastructure options for each utility. Wi,j denotes a standardized value corresponding to the timing within realization i that infrastructure option j is built.

With the data available, we can begin clustering!

Here, I take advantage of the scikit-learn.cluster module, which has both hierarchical and K-Means clustering capabilities.

For the purposes of this demonstration, I am going to only show the process of producing the hierarchical clusters… the curious reader may choose to explore alternative clustering methods available in the scikit-learn toolbox.

### Perform hierarchical clustering with 3 clusters ###

# Import the necessary tools
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import pairwise_distances_argmin

# Begin by assuming 3 clusters
num_clusters = 3

# Initialize the hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters = 3)

# Cluster the data
hierarchical.fit(cluster_input)  

# Produce a array which specifies the cluster of each pathway
hierarchical_labels = hierarchical.fit_predict(cluster_input)

In order to interpret the significance of these clusters, we want find a way to visualize differences in cluster behavior through the infrastructure pathways.

To do this, I am going to consider the median timing of each infrastructure option within each cluster, across all 1,000 realizations. Additionally, I want to specify sort the clusters according to some qualitative characteristic. In this case, I define pathways with early infrastructure investments as “heavy”, and pathways within infrastructure investment later in the period as “light”. The “moderate” infrastructure pathways are contained within the cluster between these these.

The following script calculates the median of the clustered pathways, and sorts them according to their infrastructure timing.

### Interpreting clusters via pathways ###

# Group realizations by pathway
# Calculate the median week each infra. opt. is constructed in each cluster
import numpy as np
cluster_pathways = []
cluster_medians = []

# Loop through clusters 
for i in range(0,num_clusters):
    
    # Select realizations contained within i-th cluster
    current_cluster = cluster_input[hierarchical_labels ==i, :]
    
    # Multiply by 2344 weeks to convert from [0,1] to [0,2344] weeks
    current_cluster = current_cluster*2344
    
    cluster_pathways.append(current_cluster)
    
    # Find the median infrastructure build week in each realization
    current_medians = np.zeros(len(current_cluster[0,:]))
    
    # Loop through infrastructure options and calculate medians
    for j in range(0, len(current_cluster[0,:])):
        current_medians[j] = np.median(current_cluster[:,j])
        
    cluster_medians.append(current_medians)

# Sort clusters by average of median build week 
# to get light, moderate, and heavy infrastructure clusters
cluster_means = np.mean(cluster_medians, axis = 1)

sorted_indices = np.argsort(cluster_means)

# Re-order cluster medians according to sorted mean build week
cluster_medians = np.vstack((cluster_medians[sorted_indices[2]], 
                            cluster_medians[sorted_indices[1]],
                            cluster_medians[sorted_indices[0]]))

With the median timing of each infrastructure option specified for each cluster, I can begin the process of visualizing the behavior. The function below plots a single infrastructure pathway:

import numpy as np
import matplotlib.pyplot as plt

def plot_single_pathway(cluster_medians, cluster_pathways, inf_options_idx,
                        c, inf_names, y_offset, ylabeling, xlabeling, 
                        plot_legend, ax):
    
    """
    Makes a plot of an infrastructure pathway.
    
    PARAMETERS:
        cluster_medians: an array with median weeks each option is built
        cluster_pathways: an array with every pathway in the cluster
        inf_options: an array with numbers representing each option (y-axis vals)
        should start at zero to represent the "baseline"
        c: color to plot pathway
        y_offset: distance to offset each pathway from 0 (easier to visualize stacked paths)
        ylabeling: labeling for infrastructure options
        xlabeling: labeling x axis
        plot_legend: (bool) to turn on or off
        ax: axes object to plot pathway
    """
    
    # Extract all non-baseline infrastructure options
    inf_options_idx_no_baseline=inf_options_idx[1:]
    sorted_inf = np.argsort(cluster_medians)
    
    # Plot Pathways
    # create arrays to plot the pathway lines. To ensure pathways have corners 
    # we need an array to have length 2*num_inf_options
    pathway_x = np.zeros(len(cluster_medians)*2+2)
    pathway_y = np.zeros(len(cluster_medians)*2+2)
    
    # To have corners, each inf. opt. must be located
    # in both the cell it is triggered, and adjacent cell
    
    cluster_medians = np.rint(cluster_medians/45)
    for i in range(0,len(cluster_medians)):
        for j in [1,2]:
            pathway_x[i*2+j] = cluster_medians[sorted_inf[i]]
            pathway_y[i*2+j+1] = inf_options_idx_no_baseline[sorted_inf[i]]

    # Set the end of the pathway at year 45
    pathway_x[-1] = 45

    # plot the pathway line    
    ax.plot(pathway_x, pathway_y + y_offset, color=c, linewidth=5, 
            alpha = .9, zorder=1)

    # Formatting plot elements 
    ax.set_xlim([0,44])
    inf_name_lables = inf_names
    inf_options_idx = np.arange(len(inf_name_lables))
    ax.set_yticks(inf_options_idx)
    ax.set_xlabel('Week')
    
    if ylabeling:
        ax.set_yticklabels(inf_name_lables)
    else:
        ax.set_yticklabels([])
    
    ax.set_ylim([-0.5,len(cluster_medians)+.5])

    plt.show()

The above script can be used to plot the median pathway for a single cluster, but it would be much more worthwhile to compare the median pathways of the different clusters against one another. To do this, we can define another function which will use the plot_single_pathway() iteratively to plot each of the median cluster pathways on the same figure.

import numpy as np
import matplotlib.pyplot as plt
from plot_single_pathway import plot_single_pathway

def plot_many_paths(clust_meds, clust_pathways, n_clusters, cluster_colors, fig, 
                    gspec, fig_col, ylabeling, plot_legend):
    
    
    y_offsets = [-.15, 0, .15]
    
    # Specific information for Utility
    inf_options_idx = np.array([0, 1,2, 3, 4])
    utility_inf = ['Baseline', 'Small intake expans.', 'Large intake expans.',
                    'Small WTP expans.', 
                    'Large WTP expans.']
    
    ax = fig.add_subplot(gspec[0,0])
    
    for i in np.arange(n_clusters):
        plot_single_pathway(clust_meds[i], clust_pathways[i], inf_options_idx, cluster_colors[i],
                            utility_inf, y_offsets[i], ylabeling, False, True, ax)
        plt.show()
    
    if plot_legend and (n_clusters == 3):
        ax.legend(['Light inf.', 'Moderat inf.', 'Heavy inf.'], 
                  loc='upper left')
    elif plot_legend and (n_clusters == 2):
        ax.legend(['Light inf.', 'Heavy inf.'], loc='upper left')
        
    ax.tick_params(axis = "y", which = "both", left = False, right = False)
    
    plt.show()
    return fig

Time for the reveal! Execution of this final portion of the script will plot each of the three clusters, and color-code them according to classifications as “heavy”, “moderate”, or “light” infrastructure.

### Plotting ###
from plot_many_paths import plot_many_paths

fig = plt.figure(figsize = (5,5))
gspec = fig.add_gridspec(ncols=1, nrows=1)

# Colorscale
heavy_inf_color = '#132a13'
mid_inf_color = '#4f772d'
light_inf_color = '#90a955'
cluster_colors = [light_inf_color, mid_inf_color, heavy_inf_color]

plot_many_paths(cluster_medians, cluster_pathways, 3, cluster_colors, 
                fig, gspec, 0, True, True)
Figure: Median infrastructure pathways for clustered pathways.

Voilà!

The figure above shows a clear contrast in the behavior of the light and heavy infrastructure pathways; the heavy infrastructure pathway is not only characterized by early infrastructure construction, but also by more frequent construction throughout the simulation period, in comparison to the median of the light infrastructure cluster which only requires a single infrastructure expansion within the same simulation period.

From the plot above, it is not apparent that three clusters are necessarily appropriate for this data set. Although the heavy and light infrastructure clusters are clearly demarcated from one another, the moderate infrastructure cluster has characteristics of both: a late first-expansion similar to the light infrastructure cluster, but a high frequency of construction throughout the period similar to the heavy infrastructure. Should the moderate infrastructure really be a separate cluster?

Let’s take advantage of the knowledge we developed earlier, and consider the silhouette scores corresponding to different numbers of clusters! Fortunately, scikit-learn has a suite of tools that made the silhouette analysis very easy.

In the following script, I evaluate the silhouette score associated with clustering for 2, 3, and 4 different clusters.

### Performing silhouette analysis ###
from sklearn.metrics import silhouette_score

num_clust = [2, 3, 4]

# Set up hierarchical clustering with different numbers of clusters
ac2 = AgglomerativeClustering(n_clusters = num_clust[0])
ac3 = AgglomerativeClustering(n_clusters = num_clust[1])
ac4 = AgglomerativeClustering(n_clusters = num_clust[2])

# Extract the silhouette score for each hierarchical grouping
silhouette_scores = []
silhouette_scores.append(silhouette_score(cluster_input, ac2.fit_predict(cluster_input)))
silhouette_scores.append(silhouette_score(cluster_input, ac3.fit_predict(cluster_input)))
silhouette_scores.append(silhouette_score(cluster_input, ac4.fit_predict(cluster_input)))
                                                  
plt.bar(num_clust, silhouette_scores)
plt.xlabel('Number of clusters', fontsize = 14)
plt.ylabel('Silhouette Score', fontsize = 14)
plt.xticks(num_clust)
plt.title("Silhouette analysis")
plt.show()
Figure: Silhouette analysis for the hierarchical clustering of infrastructure pathways.

Interestingly, the silhouette score was highest when the clustering was performed using only two clusters, suggesting that the moderate infrastructure class may not have been sufficiently unique to warrant a separate cluster.

Another method of visualizing the cluster structure is through the use of dendrograms. Below, I show dendrograms color-coded with only two clusters (left) and using three clusters (right).

Figure: Dendrograms showing the cluster structure of infrastructure pathways for a utility, using 2 (left) and 3 (right) clusters.

The dendrogram containing only two clusters (left) reveals that the moderate and light clusters can be combined into a single cluster, because the moderate cluster is less-different from the light infrastructure cluster than it is from the heavy infrastructure cluster. Although, it is important to note that dendrograms are not a reliable method of determining the number of clusters on their own. In fact, given that the difference in silhouette scores between the 2-cluster and 3-cluster hierarchies is relatively modest (~0.05) there is not sufficiently strong justification to choose the 2 clusters over 3 clusters.

For the sake of comparison, here are the 2-cluster and 3-cluster median pathway plots:

Conclusion

I hope you’ve gained some insight into clustering, infrastructure pathways (or preferably both) in this post.

Analysis of infrastructure pathways via statistical clustering presents a unique application of a very common statistical tool. In this analysis I clustered the pathways with respect to the timing of the infrastructure expansions, however there is potential for a much more in-depth analysis of how the hydrologic or urban contexts of each realization shape these infrastructure pathways… possibly justifying another blog post at a later date. Stay tuned!

Thanks again to David Gold, for providing the data and the foundational python code.

References

Zeff, H. B., Herman, J. D., Reed, P. M., & Characklis, G. W. (2016). Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways. Water Resources Research52(9), 7327-7346.

Leave a comment