Visualizing High Dimensional Data Using the T-SNE Method

In this blog post, I am going to go over a machine learning method that is used to visualize high dimensional datasets. The method is called T-distributed Stochastic Neighbor Embedding, or T-SNE for short. The method was developed by the Google Brain Team (van der Maaten & Hinton, 2008). In this blog post, I describe T-SNE and provide a simple example in R.

Two weeks ago, Dave introduced a few useful dimensionality reduction methods to visualize the Pareto front. Last week, Rohini’s blog post provided some very useful information about self-organizing maps. I guess this blog post will officially make this a series of posts on dimensionality reduction and the visualization of high dimensional datasets. If you haven’t already read these two super informative blog posts, I highly recommend you take a look at them.

T-SNE is an extension of the Local Linear Embedding technique, and it uses an interesting method to maintain the local relationships and similarities that exist in the original high-dimensional datasets when transferring that into a 2-D plane. As van der Maaten and Hinton (2008) argue, other dimensionality reduction techniques (e.g., PCA) usually focus on macro distances in the datasets, which might lead to overlooking short-distance relationships. However, T-SNE is able to capture both local and macro distances in the data structure. Therefore, it can give more reasonable results when you are dealing with non-linear high dimensional datasets.

T-SNE uses the following formula, which basically creates a Gaussian distribution over each data point and computes the density of all other data points in respect to the distribution around that point. The denominator of the equation normalizes the nominator.

In this equation, xi is the reference point, and xj is its neighboring points. Also, Sigma is the bandwidth that returns the same perplexity for each point. Perplexity is a measure of uncertainty that has a direct relationship with entropy. For more information about it, you can read this Wikipedia page. Basically, perplexity is a hyper parameter of T-SNE, and the final outcome might be very sensitive to its value. Also, in this equation, pij is the probability of the similarity of each two points in the high dimensional space. Basically, if two points are close to each other, pij will be high. If they are far from each other, the probability of similarity is low.

To make the low dimensional map consistent with its original high dimensional dataset, T-SNE also calculates a similar point-to-point probability of similarity in the 2D map.

Basically, T-SNE moves the point on the 2D plain to find the locations that minimize the Kullback-Leibler divergence metric between the Pi distributions of the original data and the Qi distributions of the 2D plane.

As you probably noticed, in the low dimensional space, T-SNE uses the Student T-distribution curve (with one degree of freedom). The reason is that Student T (compared to Gaussian) is a fat-tailed distribution. This allows the T-SNE to separate dissimilar points with higher margins, which makes the map easier to interpret.

T-SNE Example in R

Here, I provide a simple R code example that demonstrates how you can use T-SNE to visualize high dimensional datasets. This example only has four dimensions, so it doesn’t really represent a high dimensional problem. As such, keep in mind that T-SNE shows its true capabilities when there are tens or even hundreds of dimensions.

First, you need to install “tsne” library and load it.

install.packages("Rtsne”)
library(Rtsne)

# We also need to load ggplot2 and ggpubr libraries

library(ggplot2)
library(ggpubr)

In this example, I use R’s famous “iris” data set. Here is how you can activate it.

# Load the iris dataset
data(iris)

Now you can use the following scrip to visualize the data using T-SNE.

# Calculating T-SNE's Y values which indicate where each cluster is on the 2-D map
y_tsne <- Rtsne(iris[,1:4],pca=FALSE,perplexity=30, max_iter=1000, check_duplicates = FALSE) # Run TSNE

# Create a data frame using IRIS values and the T-SNE values
df_to_gg<-as.data.frame(cbind(iris$Species, as.data.frame(y_tsne$Y)))

# Specify column names
names(df_to_gg)<-c("Species", "Y.1", "Y.2")

# Show the objects in the 2D T-SNE representation
ggplot(df_to_gg, aes(x=Y.1, y=Y.2, color=Species))+geom_point()+theme_minimal() +
  labs(title=paste("T-SNE Visualization- Perplexity Number = 30" )) +
  scale_color_manual(values = c("darkblue", "orange", "pink3"))

Here is the 2-D map that our code generates

Although T-SNE is a powerful method and has become quite popular in recent years, it might have some pitfalls. This blog post discusses some of these issues, including the sensitivity of the final map to perplexity hyperparameters or the fact that it can be tricky to interpret T-SNE maps because they might not represent the actual relationships between different clusters.

To explore the impacts of the perplexity hyperparameter on the final clusters, I use the following code to create T-SNE maps for different perplexity values.

# Here are the perplexity values that I am taking into account in my T-SNE analysis
perplexity_number_values<-c(2,5,10, 15, 30, 40)

# Initializing a plot list object
plot_list = list()

# This loop explores the effect of perplexity values on the T-SNE results
for (i_prp in 1:6){
  
  # I am setting a seed to makes sure that my results for different perplexities are not sensitive to any random factors
  set.seed(10)
  
  # Perplexity number
  perplexity_number<- perplexity_number_values[i_prp]
  
  # Calculating T-SNE's Y values which indicate where each cluster is on the 2-D map
  y_tsne <- Rtsne(iris[,1:4],pca=FALSE,perplexity=perplexity_number, max_iter=1000, check_duplicates = FALSE) # Run TSNE
  
  # Create a data frame using IRIS values and the T-SNE values
  df_to_gg<-as.data.frame(cbind(iris$Species, as.data.frame(y_tsne$Y)))
  
  # Specify Column names
  names(df_to_gg)<-c("Species", "Y.1", "Y.2")
  
  # Show the objects in the 2D T-SNE representation
  plt<-ggplot(df_to_gg, aes(x=Y.1, y=Y.2, color=Species))+geom_point()+theme_minimal() +
    labs(title=paste("T-SNE Visualization- Perplexity Number = ", perplexity_number )) +
    scale_color_manual(values = c("darkblue", "orange", "pink3"))
  
  # Save figure in the plot list object
  plot_list[[i_prp]]=plt
  
  
}
# Combine all the plot objects
plt_combined<-ggarrange(plotlist = plot_list[1:6], ncol = 2, nrow = 3, common.legend = T)

Here is the final result that underlines the high sensitivity of the location of our clusters on 2-D map to the choice of the perplexity values. There are other hyperparameters that can be tried as well, for example, the maximum number of iterations might change the final output in some cases.

There are many other very good tutorials on T-SNE, such as here and here. Also, for example, this tutorial provides a nice and easy-to-follow Python code for T-SNE. Finally, there are some great YouTube videos that clearly explain the logic behind T-SNE and its algorithm (for example, here and here).

Using the Python Borg Wrapper – Lake Problem Example

Python compatibility with the Borg MOEA is highly useful for practical development and optimization of Python models. The Borg algorithm is implemented in C, a strongly-typed, compiled language that offers high efficiency and scalability for complex optimization problems. Often, however, our models are not written in C/C++, rather simpler scripting languages like Python, MATLAB, and R. The Borg Python wrapper allows for problems in Python to be optimized by the underlying C algorithm, maintaining efficiency and ease of use. Use of the Python wrapper varies slightly across operating systems and computing architectures. This post will focus on Linux systems, like “The Cube”, our computing cluster here at Cornell. This post will work with the most current implementation of the Borg MOEA, which can be accessed with permission from this site.

The underlying communications between a Python problem and the Borg C source code are handled by the wrapper with ctypes, a library that provides Python compatibility with C types and functions in shared libraries. Shared libraries (conventionally .so files in Linux/Unix) provide dynamic linking, a systems tool that allows for compiled code to be linked and reused with different objects. For our purposes, we can think of the Borg shared library as a way to compile the C algorithm and reuse it with different Python optimization runs, without having to re-compile any code. The shared library gives the wrapper access to the underlying Borg functions needed for optimization, so we need to create this file first. In the directory with the Borg source code, use the following command to create the (serial) Borg shared library.

gcc -shared -fPIC -O3 -o libborg.so borg.c mt19937ar.c –lm 

Next, we need to move our shared library into the directory containing the Python wrapper (borg.py) and whatever problem we are optimizing.

In this post, we’ll be using the Lake Problem DPS formulation to demonstrate the wrapper. Here’s the source code for the problem:

@author: Rohini

#DPS Formulation 

#Objectives:
#1) Maximize expected economic benefit
#2) Minimize worst case average P concentration 
#3) Maximize average inertia of P control policy
#4) Maximize average reliability 

#Constraints: 
#Reliability has to be <85%

#Decision Variables 
#vars: vector of size 3n 
#n is the number of radial basis functions needed to define the policy
#Each has a weight, center, and radius (w1, c1, r1..wm,cn,rn)

#Time Horizon for Planning, T: 100 Years
#Simulations, N: 100 

"""

import numpy as np
from sys import *
from math import *
from scipy.optimize import root
import scipy.stats as ss

# Lake Parameters
b = 0.42
q = 2.0

# Natural Inflow Parameters
mu = 0.03
sigma = np.sqrt(10**-5)

# Economic Benefit Parameters
alpha = 0.4
delta = 0.98

# Set the number of RBFs (n), decision variables, objectives and constraints
n = 2
nvars = 3 * n
nobjs = 4
nYears = 100
nSamples = 100
nSeeds = 2
nconstrs = 1

# Set Thresholds
reliability_threshold = 0.85
inertia_threshold = -0.02


###### RBF Policy ######

#Define the RBF Policy
def RBFpolicy(lake_state, C, R, W):
    # Determine pollution emission decision, Y
    Y = 0
    for i in range(len(C)):
        if R[i] != 0:
            Y = Y + W[i] * ((np.absolute(lake_state - C[i]) / R[i])**3)

    Y = min(0.1, max(Y, 0.01))

    return Y


###### Main Lake Problem Model ######

def LakeProblemDPS(*vars):

    seed = 1234

    #Solve for the critical phosphorus level
    def pCrit(x):
        return [(x[0]**q) / (1 + x[0]**q) - b * x[0]]

    sol = root(pCrit, 0.5)
    critical_threshold = sol.x

    #Initialize arrays
    average_annual_P = np.zeros([nYears])
    discounted_benefit = np.zeros([nSamples])
    yrs_inertia_met = np.zeros([nSamples])
    yrs_Pcrit_met = np.zeros([nSamples])
    lake_state = np.zeros([nYears + 1])
    objs = [0.0] * nobjs
    constrs = [0.0] * nconstrs

    #Generate nSamples of nYears of natural phosphorus inflows
    natFlow = np.zeros([nSamples, nYears])
    for i in range(nSamples):
        np.random.seed(seed + i)
        natFlow[i, :] = np.random.lognormal(
            mean=log(mu**2 / np.sqrt(mu**2 + sigma**2)),
            sigma=np.sqrt(log((sigma**2 + mu**2) / mu**2)),
            size=nYears)

    # Determine centers, radii and weights of RBFs
    C = vars[0::3]
    R = vars[1::3]
    W = vars[2::3]
    newW = np.zeros(len(W))

    #Normalize weights to sum to 1
    total = sum(W)
    if total != 0.0:
        for i in range(len(W)):
            newW[i] = W[i] / total
    else:
        for i in range(len(W)):
            newW[i] = 1 / n

    #Run model simulation
    for s in range(nSamples):
        lake_state[0] = 0
        Y = np.zeros([nYears])

        #find policy-derived emission

        Y[0] = RBFpolicy(lake_state[0], C, R, newW)

        for i in range(nYears):
            lake_state[i + 1] = lake_state[i] * (1 - b) + (
                lake_state[i]**q) / (1 +
                                     (lake_state[i]**q)) + Y[i] + natFlow[s, i]
            average_annual_P[
                i] = average_annual_P[i] + lake_state[i + 1] / nSamples
            discounted_benefit[
                s] = discounted_benefit[s] + alpha * Y[i] * delta**i

            if i >= 1 and ((Y[i] - Y[i - 1]) > inertia_threshold):
                yrs_inertia_met[s] = yrs_inertia_met[s] + 1

            if lake_state[i + 1] < critical_threshold:
                yrs_Pcrit_met[s] = yrs_Pcrit_met[s] + 1

            if i < (nYears - 1):
                #find policy-derived emission
                Y[i + 1] = RBFpolicy(lake_state[i + 1], C, R, newW)

# Calculate minimization objectives (defined in comments at beginning of file)
    objs[0] = -1 * np.mean(discounted_benefit)  #average economic benefit
    objs[1] = np.max(
        average_annual_P)  #minimize the max average annual P concentration
    objs[2] = -1 * np.sum(yrs_inertia_met) / (
        (nYears - 1) * nSamples
    )  #average percent of transitions meeting inertia thershold
    objs[3] = -1 * np.sum(yrs_Pcrit_met) / (nYears * nSamples
                                            )  #average reliability

    constrs[0] = max(0.0, reliability_threshold - (-1 * objs[3]))

    return (objs, constrs)

The important function for this blog post is LakeProblemDPS, which demonstrates how to configure your own problem with the wrapper. Your function must take in *vars, the decision variables, and return objs, a list of objective values (or a tuple of objective values and constraints). Within the problem, refer to vars[i] as the i-th decision variable, for i in [0,nvars-1]. Set the list of objective values in the same manner.

Once our problem is defined and compatible with the wrapper, we can optimize with Borg. The following code runs the Lake problem optimization for 10,000 function evaluations.

# Serial Borg run with Python wrapper
# ensure libborg.so is compiled and in this directory
from borg import *
from lake import *

maxevals = 10000

# create an instance of the serial Borg MOEA
borg = Borg(nvars, nobjs, nconstrs, LakeProblemDPS)

# set the decision variable bounds and objective epsilons
borg.setBounds(*[[-2, 2], [0, 2], [0, 1]] * int((nvars / 3)))
borg.setEpsilons(0.01, 0.01, 0.0001, 0.0001)

# perform the optimization
# pass in a dictionary of arguments, as defined in borg.py
result = borg.solve({"maxEvaluations": maxevals})

# print the resulting objectives
for sol in result:
    print(sol.getObjectives())

Note the constructor Borg() creates an instance of the Borg algorithm with a specified number of variables, objectives, and constraints. The LakeProblemDPS argument is the objective function to be optimized by this instance of Borg. The setBounds and setEpsilons methods are required. solve() performs the optimization and takes in a dictionary of Borg parameters. See borg.py for a comprehensive list.

Using the Python wrapper to run the Parallel Borg MOEA

The previous example uses the serial Borg MOEA, but the wrapper also supports the master-worker and multi-master parallelizations. Configuring a parallel version of the algorithm requires a few additional steps. First, you must compile a shared library of the parallel implementation and move it to the wrapper directory.

For the master-worker version, use:

mpicc -shared -fPIC -O3 -o libborgms.so borgms.c mt19937ar.c -lm

For the multi-master version, use:

mpicc -shared -fPIC -O3 -o libborgmm.so borgmm.c mt19937ar.c -lm

To call the master-worker version, you must explicitly start up and shut down MPI using the Configuration class provided in borg.py. The following code performs a parallel master-worker optimization of the Lake problem:

# Master-worker Borg run with Python wrapper
# ensure libborgms.so or libborgms.so is compiled and in this directory
from borg import *
from lake import *

# set max time in hours
maxtime = 0.1

# need to start up MPI first
Configuration.startMPI()

# create an instance of Borg with the Lake problem
borg = Borg(nvars, nobjs, nconstrs, LakeProblemDPS)

# set bounds and epsilons for the Lake problem
borg.setBounds(*[[-2, 2], [0, 2], [0, 1]] * int((nvars / 3)))
borg.setEpsilons(0.01, 0.01, 0.0001, 0.0001)

# perform the optimization
result = borg.solveMPI(maxTime=maxtime)

# shut down MPI
Configuration.stopMPI()

# only the master node returns a result
# print the objectives to output
if result:
    for solution in result:
        print(solution.getObjectives())

This script must be called as a parallel process – here’s a SLURM submission script that can be used to run the optimization on 16 processors (compatible for The Cube):

#!/bin/bash
#SBATCH -J py-wrapper
#SBATCH -o normal.out
#SBATCH -e normal.err
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 16

mpirun python3 mslake.py

sbatch submission.sbatch will allocate one node with 16 processors for the optimization run.

Troubleshooting

Depending on your machine’s MPI version and your shell’s LD_LIBRARY_PATH environment variable, the Borg wrapper may try to access an unavailable mpi shared library. This issue happens on our cluster, the Cube, and causes the following error:

OSError: libmpi.so.0: cannot open shared object file: No such file or directory

In borg.py, the startMPI method attempts to access the nonexistent libmpi.so.0 shared library. To fix this, find the location of your mpi files with:

echo $LD_LIBRARY_PATH

Likely, a directory to your mpi library (i.e. /opt/ohpc/pub/mpi/openmpi3-gnu8/3.1.4/lib on the cube) will print. (Note, if such a path does not exist, set the LD_LIBRARY_PATH environment variable to include your mpi library) Navigate to this directory and view the file names. On the Cube, libmpi.so.0 (the library the Borg wrapper is trying to access) does not exist, but libmpi.so does (this is a software versioning discrepancy). Back in the startMPI method in borg.py, change the line

CDLL("libmpi.so.0", RTLD_GLOBAL)

to access the existing mpi library. On the cube:

CDLL("libmpi.so", RTLD_GLOBAL)

Deeper Dive into Self-Organizing Maps (SOMs)

This blog post focuses on Self-Organizing Maps (SOMs) or Kohonen maps, first created by Teuvo Kohonen in the 1980s, and most recently introduced in Dave Gold’s latest blog post. SOMs are a type of artificial neural network (ANN) that are trained using unsupervised learning to produce a two-dimensional pattern representation of higher-dimensional datasets. SOMs are therefore a good algorithm for dimension reduction, and inputs that are closer in their higher-dimensional input vectors are also mapped closely together in the resulting 2D representation [1].

Self Organizing Maps. Recently, I learned about SOMs while… | by ...
User-defined neuron grid that represents the basis of SOM  [1]

SOMs can be very effective for clustering/classifying outputs and are advantageous because they “self-organize” to identify clusters. The SOM map is first initialized with a user-defined 2D grid of neurons instead of containing layers like traditional ANNs. The weight vector that characterizes the neurons can be randomly initialized with value or more favorably, by sampling across the eigenvectors of the two largest PCs of the input dataset. The training process that SOMs use to build maps is called “competitive learning”, which is also fundamentally different from backpropagation used for ANNs. During competitive learning, an input training point is fed into the algorithm. Its Euclidean distance from each of the neurons is calculated, and the closest neuron is termed the best matching unit, or BMU. The weight of the neuron and surrounding neurons is adjusted to be closer to the related input vector, using an update formula. This process is repeated for every input in the training data and for a large number of iterations until patterns and clusters begin to emerge [3]. The animation on the right illustrates competitive learning. Note how the locations of neurons are adjusted as training points get assigned to their respective BMUs and how whole grid shifts in response to the new samples.

220px-TrainSOM
Competitive learning [2]

At a high level, this is how SOMs are trained. Next, we will go through a clustering example using the kohonen package offered in R.

Test Case: Maryland Trout

For this text case, we will be using a presence/absence dataset of Maryland trout. Each data point is characterized by a feature vector that includes information on agricultural land cover, logarithmic distance to the nearest road, riffle quality, dissolved oxygen, and spring water temperature. The label associated with each point is a 0 or 1, denoting if trout was observed or not. We will train a SOM to cluster the feature vector and then investigate the characteristics of the resulting clusters. Below is a snippet of our dataset. We use columns 3-7 as our feature vector while column 2 denotes our label. SOMs can also be used for classification, which is not discussed in this blog post. Therefore, you can train the SOM only on a subset of the dataset and used the trained SOM to predict the clusters that the testing points will be assigned to.

inputset

First we must scale our data in order to make sure that variables in our feature vector that may have different magnitudes are treated equally by the algorithm. Then we define the SOM grid. The choice of the number of neurons generally follows the rule: 5*sqrt(# of datapoints). In our case, we have 84 datapoints, so technically a grid that is smaller than 7×7 will work well. I decide to use 5×5, which will be explained in the next paragraph. I also specify that I want the grid to be in a hexagonal shape.  Finally, we train our model by feeding in our input data, the user-defined grid, and specifying an iteration rate of 1000. This means that the input data will be presented 1000 times to the algorithm.

require(kohonen)
data=read.csv("Maryland.Trout.csv")
#Create the training dataset
data_train &amp;amp;lt;-data[, c(3:7)]
# Change the data frame with training data to a matrix
# Also center and scale all variables to give them equal importance during
# the SOM training process.
data_train_matrix &amp;amp;lt;- as.matrix(scale(data_train))
# Create the SOM Grid
som_grid &amp;amp;lt;- somgrid(xdim = 5, ydim=5, topo="hexagonal")
# Finally, train the SOM
som_model &amp;amp;lt;- som(data_train_matrix,
                 grid=som_grid, rlen = 1000,
                 keep.data = TRUE )

Once the SOM is trained, there are a variety of figures that can be easily created. We can plot the training progress and observe how the distance to the BMU decreases as a function of iteration and starts to level off as we reach many iterations. This is ideal behavior because it means that the training points are associating closely with their BMU.

plot(som_model, type="changes")
Training
Average distance from a test point to its BMU

We can also plot a heat map of the node counts, which shows how many of each of the training inputs are associated with each node. Any node that is grey means that no training points were assigned to that BMU. This may indicate that the map is too big, so you can go back and adjust the grid size adjust accordingly until the grey points are no longer existent. For this dataset, this was around 5×5. Furthermore, we want the number of points associated with each node to be as uniform as possible.

plot(som_model, type="count", main="Node Counts")
Nodes

Next we can view the neighbor distances for each neuron, which is also called the U-matrix. Small distances indicate similarity between neighboring nodes.

plot(som_model, type="dist.neighbours", main = "SOM neighbor distances")
neighbor

We can also view the weight vector of the nodes, which are termed codes. The fan in each node indicates the variables of prominence that are linking the datapoints assigned to the neuron. In our dataset, you can see that in the upper left corner of the map, the algorithm is assigning datapoints to these neurons because of the similarity among the agricultural area and water temperature variables in the feature vector. In the lower right corner, these datapoints are linked primarily due to similarities in riffle quality, and dissolved oxygen. This type of map is useful to understand by what features the points are actually starting to cluster.

plot(som_model, type="codes")
codes

In my opinion, one of the most useful standard figures that the kohonen library creates are heatmaps that show the gradient of input values across the nodes. A heatmap can be created for each feature and when placed side by side, they can help demonstrate correlation between inputs. We can clearly see that areas characterized by low agricultural land tend to have a better riffle quality, lower water temperature, and a higher dissolved oxygen. While these patterns are expected and have a strong physical explanation, these maps can also help to discover other unexpected patterns. Note that the values on the side bar are normalized and can be transformed back to the original scale.

 

#Plot heatmaps for each feature
plot(som_model, type = "property", property = getCodes(som_model)[,1], main=colnames(getCodes(som_model))[1],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,2], main=colnames(getCodes(som_model))[2],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,3], main=colnames(getCodes(som_model))[3],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,4], main=colnames(getCodes(som_model))[4],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,5], main=colnames(getCodes(som_model))[5],palette.name=magma)
#Unscale the variables- do this for each input column in data_train
var_unscaled &amp;amp;lt;- aggregate(as.numeric(data_train[,1]), by=list(som_model$unit.classif), FUN=mean, simplify=TRUE)[,2]
plot(som_model, type = "property", property=var_unscaled, main=colnames(getCodes(som_model))[1], palette.name=magma)
heatmaps

Finally we can cluster our neuron codes weight vectors) to add some boundaries to our map. Typically what is done is to use k-means clustering to determine an appropriate number of clusters that, in this case, minimize the within-cluster sum of squares (WCSS). We can then plot the results and look for an elbow in the scree plot.

 

#k-means clustering code from [4]
mydata = som_model$codes[[1]]
wss = (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) {
wss[i] = sum(kmeans(mydata, centers=i)$withinss)
}
par(mar=c(5.1,4.1,4.1,2.1))
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="Within cluster sum of squares (WCSS)") #Define a palette and cluster with the number of clusters determined using k-means
WCCS

The elbow isn’t so clear in this dataset, but I go with 8 clusters and then use hierarchical clustering to assign each neuron to a cluster. Finally, we can obtain the cluster that each point is assigned to and insert that back as a column in our dataset.

#Final clustering code from [4]
#Define a palette and cluster with the number of clusters determined using k-means
pretty_palette &amp;amp;lt;- c("#1f77b4", '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2')
som_cluster &amp;amp;lt;- cutree(hclust(dist(som_model$codes[[1]])), 8)
# plot these results:
plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters")
add.cluster.boundaries(som_model, som_cluster)

# get vector with cluster value for each original data sample
cluster_assignment &amp;amp;lt;- som_cluster[som_model$unit.classif]
# for each of analysis, add the assignment as a column in the original data:
data$cluster &amp;amp;lt;- cluster_assignment

clustering

Post-Processing

We’re now interested in understanding which of our points actually clustered together, so we generate summary statistics for each cluster. The points highlighted in blue represent the best values for the feature while the points in red represent the worst. Note that the clusters with a large amount of trout presence on average are characterized by best values of features that are favorable for aquatic life, while the clusters that exhibit a lower presence of trout do not have good values for these features. You can also see that clusters with similar characteristics are side by side on the clustering map.

summary

Overall, I found the kohonen library to be a really useful tool to start to internalize how SOMs work and to fairly quickly be able to do a nice analysis on both the features and the resulting clusters. I would highly recommend this package as a starting point. There are packages in Python such as SimpSom and minisom that have snappier graphics and will be a good next step to explore.

References

[1]https://towardsdatascience.com/self-organizing-maps-ff5853a118d4

[2] By Chompinha – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=77822988

[3]http://fourier.eng.hmc.edu/e161/lectures/som/node2.html

[4] Much direction for structuring this tutorial comes from https://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/

Lower dimensional visualizations of many-objective Pareto Fronts

Understanding the geometry of a many-objective Pareto front can be challenging due to the high dimensionality of many-objective problems. Often, we use tools such as parallel coordinate plots, glyph plots and pairwise scatter plot matrices to identify trade-offs and select candidate alternatives. While these tools are useful to decision making, they don’t always capture patterns or geometric properties that may be embedded withing many-objective Pareto fronts.

Mathematical tools for visualizing high dimensional data on lower dimensional manifolds have existed for decades, and in recent years they have been applied in many-objective contexts (Filipac and Tusar, 2018). In this post I’ll overview four common methods for representing Pareto fronts on lower dimensional manifolds and demonstrate them on two many-objective test problems: the four objective DTLZ 2 and four objective DTLZ 7 (Deb et al., 2005).

Parallel coordinate plots of the two test problems can be found in Figure 1. DTLZ 2 has a continuous Pareto front, while DTLZ 7 has a highly disconnected Pareto front. Both Pareto fronts visualized here are the analytical true Pareto fronts from the MOEAFramework.

I’ve added the code to produce the plots in this post to my repository on many-objective visualization, which can be found here.

Figure 1: Parallel coordinate plots of the four objective DTLZ 2 (left) and DTLZ 7 (right)

1. Multidimensional Scaling

Multidimensional Scaling (MDS) refers to a family of techniques that seek low dimensional representations of high dimensional spaces by preserving pairwise distances between points (Kruskal, 1978). Classic MDS attempts to preserve the euclidean distance between each pair of points by minimizing a stress function defined as:

stress = (\frac{\sum_i \sum_j (f(x_{ij}) - d_{ij})^2}{\sum_i \sum_j d_{ij}^2})^{1/2}

Where:

f(x_{ij}) is the euclidean distance between points x_i and x_j in the full dimensional space. (Note: extensions of MDS have been developed that substitute this distance for a weakly monotonic transformation of the original points)

d_{ij} is the euclidean distance between points x_i and x_j in the lower dimensional representation

To perform MDS on the test problem Pareto Fronts, I used the Manifold tool from the Yellowbrick package, a machine learning visualization module associated with sklearn. MDS representations of four objective DTLZ 2 and DTLZ 7 and shown in Figure 2. For the highly disconnected DTLZ 7 problem, MDS clearly distinguishes the 8 clusters within the Pareto Front.

Figure 2: MDS representations of the four objective DTLZ 2 (left) and DTLZ 7 (right)

2. IsoMaps

IsoMaps are an extension of MDS that first clusters points using K-nearest neighbors, then maps the points to a lower dimensional space by minimizing the geodesic distances between clusters. To create IsoMap visualizations for the test problems, I again utilized the Yellowbrick manifold function. IsoMap projections for four objective DTLZ 2 and DTLZ 7 are shown in Figure 3. Like MDS, IsoMapping is able to clearly demonstrate the disconnected nature of the DTLZ 7 Pareto front. I should mention that I had to use 30 neighbors to achieve this, which is much higher than the 8-12 neighbors recommended as an unofficial rule of thumb. This could be a result of the highly disconnected nature of DTLZ 7, which may cause problems for IsoMap.

Figure 3: IsoMap representations of the four objective DTLZ 2 (left) and DTLZ 7 (right)

3. Sammon Mapping

Like MDS and IsoMapping, Sammon Mapping (Sammon, 1969) seeks to find a lower dimensional representation that preserves the the distances between each point in the Pareto front from the high dimensional space. Sammon Mapping uses a modified version of stress known as “Sammon Stress”, defined as:

S =\sum_{i} \sum_{j>i} \frac{(d^{*}_{ij} - d_{ij})^2}{d^{*}_{ij}}

Where:

d^{*}_{ij}: is the distance between points x_i and x_j in the full objective space

d_{ij}: is the distance between points x_i and x_j in the lower dimensional space

The key difference between Sammon Stress and the classic MDS stress is that Sammon Stress is normalized by the distance in the high dimensional space rather than the low dimensional representation. This function is usually minimized by gradient descent.

I coded the Sammon maps for the two test problems using an open source implementation from tompollard on Github. Like the other two methods, Sammon mapping highlights the disconnected nature of DTLZ 7 while showing a relatively continuous representation of DTLZ 2 that suggests its spherical shape.

Figure 4: Sammon mapping representations of the four objective DTLZ 2 (left) and DTLZ 7 (right)

4. Self Organizing Maps

Self Organizing Maps (SOM; Kohonen, 1982) use an artificial neural network to create a discrete, low dimensional representation of a high dimensional space. SOMs are a form of unsupervised machine learning that are used in both classification and dimensional reduction applications.

To map the high dimensional data, SOMs start with a uniformly spaced grid of neurons, then implement a competitive learning algorithm to associate each neuron with a set of Pareto front solutions. This video has the best explanation I’ve seen on how SOMs work (thanks to Rohini for sharing it with me). I also found this Gif from Wikicommons to be helpful in understanding SOMs.

Pareto front visualizations using SOMs plot the the original uniform grid of neurons on an x-y plane, and the distance between neurons of the final map as the color. Grid points with dark shading (long distance between final neurons) indicate boundaries between clusters in the high dimensional space. SOMs for the four objective DTLZ 2 and DTLZ 7 problems are shown in Figure 5. The disconnected clusters in DTLZ 7 are clearly apparent while no real structure is shown for DTLZ 2.

Figure 5: SOM representations of the four objective DTLZ 2 (left) and DTLZ 7 (right)

Concluding thoughts

To be perfectly honest, I don’t think that any of the methods described in this post are particularly useful for understanding many-objective optimization results if used on their own. However, they may be a useful complement when exploring solution sets and finding patterns that may not be apparent when visualizing the full dimensional space. Additionally, they are all fairly straight forward to code and can easily be included in an exploratory analysis.

References

Deb, K., Thiele, L., Laumanns, M., & Zitzler, E. (2005). Scalable test problems for evolutionary multiobjective optimization. In Evolutionary multiobjective optimization (pp. 105-145). Springer, London.

Filipič, B., & Tušar, T. (2018, July). A taxonomy of methods for visualizing pareto front approximations. In Proceedings of the Genetic and Evolutionary Computation Conference (pp. 649-656).

Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological cybernetics, 43(1), 59-69.

Kruskal, J. B. (1978). Multidimensional scaling (No. 11). Sage.

Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on computers, 100(5), 401-409.

Spatial statistics (Part 1): Spatial Autocorrelation

Exploratory spatial data analysis and statistics help us to interpret maps in a more efficient way by finding trends, enabling pattern mining in space and time, identifying spatial outliers, etc. This information beyond the maps can help us to understand the characteristics of places, phenomena, and the relationships between them. Therefore, we can use it for predication and, decision-making.

For analyzing the spatial data, many tools and libraries are available such as ArcGIS, Gdal , etc. that utilize raster and vector geospatial data formats. In my next blogposts, I am going to introduce some types of spatial statistics using different libraries in R. In this blogpost, I am going to explore an auto-correlation between county-level 10-years average (2006-2016) of total potential evapotranspiration (ET) from March to the end of August. These datasets are aggregated from 4*4 km grid cells for each county across the US and can be downloaded from here. Applying spatial autocorrelation explores if the potential ET in counties near each other are more similar. In other word, it measures how distance can affect our variable of interest. The existence of autocorrelation in data might lead to incorrect statistical inferences for some spatial statistical analysis. Therefore, it is important to test for spatial autocorrelation.

First, we are going to inspect the distribution of the data. To read the spatial data (shapefile), we need to use the “rgdal” library. Set your working directory to the path where you downloaded the data.

install.packages("rgdal")
library(“rgdal”)
setwd("---your path---")
ET<- readOGR(".","US_ETp_County")

We can see the name of columns by names(ET), and the coordinate system of the data by  crs(ET).To plot the spatial objects, I am using “spplot” from the “sp” library, which is the specialized plot methods for spatial data with attributes. The first argument is object of spatial-class, which has coordinates, and the second argument “zcol” is the attribute name or column number in the attribute table associated with the data.

library("sp")
spplot(ET, zcol='MEAN_ETp_n',col.regions = topo.colors(20), main="Average Potential ET during March-August (mm)") 

We can also use quantiles in order to better distinguish the distribution of data and potentially dilute the effect of outliers. To do this we need another library:

library("classInt")
breaks_quantile <- classIntervals(ET$MEAN_ETp_n, n = 10, style = "quantile")
breaks <- breaks_quantile $brks 

ET$MEAN_ETp_qun<- cut(ET$MEAN_ETp_n, breaks)
p2<- spplot(ET, "MEAN_ETp_qun", col.regions = topo.colors(20), main = "Average Potential ET during March-August (mm)_Quantile")

As shown in both maps, the counties in the southern half of the US have higher potential ET during March-August. To explore the spatial autocorrelation we need to define the neighbors of counties.

install.packages("spdep ")
library(“spdep”)
neighbor<- poly2nb(ET,queen = FALSE)
#With “queen” option we define if a single shared boundary point meets the contiguity condition (queen =TRUE), or more than one shared point is required (queen =FALSE).
plot(ET, border = 'lightgrey')
plot(neighbor, coordinates(ET), add=TRUE, col='red')

Now, we are going to present local spatial autocorrelation that explores spatial clustering across the US. First, we need to convert the neighbor data to a listw object. The “nb2listw” adds a neighbor list with spatial weights for the chosen coding scheme. Here, I chose “W”, which is row standardized (sums over all links to n).

lw <- nb2listw(neighbor, style = "W")

To get the autocorrelations a Moran’s test is used. The local spatial statistic Moran’s index estimates a correlation for each county based on the spatial weight object used. It is a statistical way to find out the potential local clusters and spatial outliers. A positive index indicates that a feature has neighboring features with similar high or low attributes that can be part of a cluster. A negative value indicates the outlier feature.

local_moran <- localmoran(x = ET$MEAN_ETp_n, listw = nb2listw(neighbor, style = "W"))

From this, we get some statistical measures: Ii: local moran statistic, E.Ii: expectation of local moran statistic, Var.Ii: variance of local moran statistic, Z.Ii: standard deviate of local moran statistic, and Pr() p-value of local moran statistic. Now we can add this result to our shapefile (ET) and plot the local moran statistic. I am going to add the Us_States boundary to the map.

moran_local_map <- cbind(ET, local_moran)
States<- readOGR(".","US_States")
spplot(moran_local_map, zcol='Ii',col.regions = topo.colors(20),main="Local Moran's I statistic for Potential ET (March-August)", sp.layout =list("sp.polygons",states,lwd=3, first = FALSE))

Again, we can present this data with quantiles:

breaks_quantile_new  <- classIntervals(moran_local_map$Ii, n = 10, style = "quantile")
breaks_new  <- breaks_quantile_new$brks 
moran_local_map$Ii_qun <- cut(moran_local_map$Ii, breaks_new)
spplot(moran_local_map, zcol='Ii_qun',col.regions = topo.colors(20),main="Local Moran's I statistic for Potential ET (March-August)_Quantile", sp.layout =list("sp.polygons",states,lwd=3, first = FALSE))

Since the indices include negative and zero values and there are a large variations between the positive values, I modified the breaks manually in order to be able to see the variations and potential clustering with the higher Moran values:

breaks_fix<- classIntervals(moran_local_map$Ii, n=17, style="fixed",fixedBreaks=c(-0.2, -0.0001, 0.0001, 0.2, 0.4,0.6,0.8,1,4,5,6,7,8,9,10,11,12,13))

Note that the p-value should be small enough (statistically significant) for the feature to be considered as part of a cluster.

References

Anselin, L. 1995. Local indicators of spatial association, Geographical Analysis, 27, 93–115; Getis, A. and Ord, J. K. 1996 Local spatial statistics: an overview. In P. Longley and M. Batty (eds) Spatial analysis: modelling in a GIS environment (Cambridge: Geoinformation International), 261–277; Sokal, R. R, Oden, N. L. and Thomson, B. A. 1998. Local Spatial Autocorrelation in a Biological Model. Geographical Analysis, 30. 331–354; Bivand RS, Wong DWS 2018 Comparing implementations of global and local indicators of spatial association. TEST, 27(3), 716–748 https://doi.org/10.1007/s11749-018-0599-x