Alluvial Plots

We all love parallel coordinates plots and use them all the time to display our high dimensional data and tell our audience a good story. But sometimes we may have large amounts of data points whose tradeoffs’ existence or lack thereof cannot be clearly verified, or the data to be plotted is categorical and therefore awkwardly displayed in a parallel coordinates plot.

One possible solution to both issues is the use of alluvial plots. Alluvial plots work similarly to parallel coordinates plots, but instead of having ranges of values in the axes, it contains bins whose sizes in an axis depends on how many data points belong to that bin. Data points that fall within the same categories in all axes are grouped into alluvia (stripes), whose thicknesses reflect the number of data points in each alluvium.

Next are two examples of alluvial plots, the fist displaying categorical data and the second displaying continuous data that would normally be plotted in a parallel coordinates plot. After the examples, there is code available to generate alluvial plots in R (I know, I don’t like using R, but creating alluvial plots in R is easier than you think).

Categorical data

The first example (Figure 1) comes from the cran page for the alluvial plots package page. It uses alluvial plots to display data about all Titanic’s passengers/crew and group them into categories according to class, sex, age, and survival status.

alluvial_cran

Figure 1 – Titanic passenger/crew data. Yellow alluvia correspond to survivors and gray correspond to deceased. The size of each bin represents how many data points (people) belong to that category in a given axis, while the thickness of each alluvium represent how many people fall within the same categories in all axes. Source: https://cran.r-project.org/web/packages/alluvial/vignettes/alluvial.html.

Figure 1 shows that most of the passengers were male and adults, that the crew represented a substantial amount of the total amount of people in the Titanic, and that, unfortunately, there were more deceased than survivors. We can also see that a substantial amount of the people in the boat were male adult crew members who did not survive, which can be inferred by the thickness of the grey alluvium that goes through all these categories — it can also be seen by the lack of an alluvia hitting the Crew and Child bins, that (obviously) there were no children crew members. It can be also seen that 1st class female passengers was the group with the greatest survival rate (100%, according to the plot), while 3rd class males had the lowest (ballpark 15%, comparing the yellow and gray alluvia for 3rd class males).

Continuous data

The following example shows the results of policy modeling for a fictitious water utility using three different policy formulations. Each data point represents the modeled performance of a given candidate policy in six objectives, one in each axis. Given the uncertainties associated with the models used to generate this data, the client utility company is more concerned about whether or not a candidate policy would meet certain performance criteria according to the model (Reliability > 99%, Restriction Frequency < 20%, and Financial Risk < 10%) than about the actual objective values. The utility also wants to have a general idea of the tradeoffs between objectives.

Figure 2 was created to present the modeling results to the client water utility. The colored alluvia represent candidate policies that meet the utility’s criteria, and grey lines represent otherwise. The continuous raw data used to generate this plot was categorized following ranges whose values are meaningful to the client utility, with the best performing bin always put in the bottom of the plot. It is important to notice that the height of the bins represent the number of policies that belong to that bin, meaning that the position of the gap between two stacked bins does not represent a value in an axis, but the fraction of the policies that belong to each bin. It can be noticed from Figure 2 that it is relatively difficult for any of the formulations to meet the Reliability > 99% criteria established by the utility. It is also striking that a remarkably small number of policies from the first two formulations and none of the policies from the third formulation meet the criteria established by the utilities. It can also be easily seen by following the right alluvia that the vast majority of the solutions with smaller net present costs of infrastructure investment obtained with all three formulations perform poorly in the reliability and restriction frequency objectives, which denotes a strong tradeoff. The fact that such tradeoffs could be seen when the former axis is on the opposite side of the plot to the latter two is a remarkable feature of alluvial plots.

tradeoffs_all3.svg.png

Figure 2 – Alluvial plot displaying modeled performance of candidate long-term planning policies. The different subplots show different formulations (1 in the top, 3 in the bottom).

The parallel coordinates plots in Figure 3 displays the same information as the alluvial plot in Figure 2. It can be readily seen that the analysis performed above, especially when it comes to the tradeoffs, would be more easily done with Figure 2 than with Figure 3. However, if the actual objective values were important for the analysis, Figure 3 would be needed either by itself or in addition to Figure 2, the latter being used likely as a pre-screening or for a higher level analysis of the results.

tradeoffs_all-svg

Figure 3 – Parallel coordinates plot displaying modeled performance of candidate long-term planning policies. The different subplots show different formulations (1 in the top, 3 in the bottom).

The R code used to create Figure 1 can be found here. The code below was used to create Figure 2 — The packages “alluvia”l and “dplyr” need to be installed before attempting to use the provided code, for example using the R command install.packages(package_name). Also, the user needs to convert its continuous data into categorical data, so that each row corresponds to a possible combination of bins in all axis (one column per axis) plus a column (freqs) representing the frequencies with which each combination of bins is seen in the data.

# Example datafile: snippet of file "infra_tradeoffs_strong_freqs.csv"
Reliability, Net Present Cost of Inf. Investment, Peak Financial Costs, Financial Risk, Restriction Frequency, Jordan Lake Allocation, freqs
2<99,0<60,0<25,0<10,2>20,0<70,229
0>99,2>60,0<25,0<10,2>20,0<70,0
2<99,2>60,0<25,0<10,2>20,0<70,168
0>99,0<60,2>25,0<10,2>20,0<70,0
2<99,0<60,2>25,0<10,2>20,0<70,3
0>99,2>60,2>25,0<10,2>20,0<70,2
2<99,2>60,2>25,0<10,2>20,0<70,45
0>99,0<60,0<25,2>10,2>20,0<70,0
2<99,0<60,0<25,2>10,2>20,0<70,317
0>99,2>60,0<25,2>10,2>20,0<70,0
2<99,2>60,0<25,2>10,2>20,0<70,114
# load packages and prepare data
library(alluvial)
library(dplyr)

itss <- read.csv('infra_tradeoffs_strong_freqs.csv')
itsw <- read.csv('infra_tradeoffs_weak_freqs.csv')
itsn <- read.csv('infra_tradeoffs_no_freqs.csv')

# preprocess the data (convert do dataframe)
itss %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_strong
itsw %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_weak
itsn %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_no

# setup output file
svg(filename="tradeoffs_3_formulations.svg",
width=8,
height=8,
pointsize=18)
p <- par(mfrow=c(3,1))
par(bg = 'white')

# create the plots
alluvial(
its_strong[,1:6],
freq=its_strong$n,
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_strong$Financial.Risk == "0<10", "blue", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_strong$Financial.Risk == "0<10", "blue", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_strong$n < 1
)
alluvial(
its_weak[,1:6],
freq=its_weak$n,
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_weak$Financial.Risk == "0<10", "chartreuse2", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_weak$Financial.Risk == "0<10", "chartreuse2", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_weak$n < 1
)
alluvial(
its_no[,1:6],
freq=its_no$n,
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_no$Financial.Risk == "0<10", "red", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_no$Financial.Risk == "0<10", "red", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_no$n < 1
)
dev.off()
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s