Basics of data visualization with ggplot2

Basics of data visualization with ggplot2

In my previous post, I showed how wonderful the ggplot2 library in R is for visualizing complex networks. I realized that while there are several posts on this blog going over the advanced visualization capabilities of the ggplot2 library, there isn’t a primer on structuring code for creating graphs in R…yet. In this post, I will go over the syntax for creating pretty ggplot2 graphs and tweaking various parameters. I am a self-declared Python aficionado, but love using ggplot2 because it is intuitive to use, beginner-friendly, and highly customizable all at the same time.

Dataset and setup

For this tutorial, I will be using one of the built-in datasets in R called mtcars which was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Further documentation on this dataset can be found here. We import the data into our RStudio workspace.

# import the library into our workspace

# import dataset

The resultant dataset looks something like this.

Basic plot

Now that we have the data, we can get to plotting with ggplot2. We can declaratively create graphics using this library. We just have to provide the data, specify how to map properties to graph aesthetics, and the library takes care of the rest for us! We need to specify three things for each ggplot — 1) the data, 2) the aesthetics, and 3) the geometry.

Let us start by creating a basic scatterplot of the mileage (mpg) of each car as a function of its horsepower (hp). In this case the data is our dataframe mtcars, and the aesthetics x and y will be defined as the names of the columns we wish to plot along each axis — hp and mpg. We can also set the color aesthetic to indicate the number of cylinders (cyl) in each car. One of the reasons ggplot2 is so user-friendly is because each graph property can be tacked on to the same line of code with a + sign. Since we want a scatterplot, the geometry will be defined using geom_point().

# basic scatterplot
g <- ggplot(data = mtcars, aes(x = hp, y = mpg, color=cyl))
g + geom_point()

Excellent! The library automatically assigns the column names as axis labels, and uses the default theme and colors, but all of this can be modified to suit our tastes and to create pretty graphs. It is also important to note that we could have visualized the same data (less helpfully) as a line plot instead of a scatterplot, just by tweaking the geometry function.

# basic line plot
g + geom_line()

Well, this looks unpleasant. But wait, we can do so much more. We can also layer multiple geometries on the same graph to make more interesting plots.

# basic scatter+line plot
g + geom_line() + geom_point()

Additionally, we can tweak the geometry properties in each graph. Here is how we can transform the lines to dotted, and specify line widths and marker shape and size.

# change properties of geometry
g + geom_point(shape = "diamond", size = 3) +
  geom_line(color = "black", linetype = "dotted", size = .3) 

While our graph looks much neater now, using a line plot is actually pretty unhelpful for our dataset since each data point is a separate car. We will stick with a scatterplot for the rest of this tutorial. However, the above sort of graph would work great for time series data or other data that measures change in one variable.

Axis labels

One of the cardinal rules of good data visualization is to add axis labels to your graphs. While R automatically set the axis labels to be column headers, we can override this to make the axis labels more informative with just one extra function.

# change axis titles
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)")


This graph is in serious need of a title to provide a reader some idea of what they’re looking at. There are actually multiple ways to add a graph title here, but I find it easiest to use ggtitle().

# add title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower") 

Alright, having a title is helpful, but I don’t love it’s placement on the graph. R automatically left-aligns the title, where it clashes with the y-axis. I would much rather have the title right-aligned, in a bigger font, and bolded. Here is how to do that.

# change position of title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower")  +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))


There are ways to manually change the background and gridlines of ggplot2 graphs using theme(), but an easy workaround is to use the built-in themes. Which theme you use depends greatly on the graph type and formatting guidelines, but I personally like a white background, faint gridlines, and a bounding box. One thing to note here though is that theme_bw() overrides theme() so the order of these two matters.

# add theme
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

We can also use the theme() function to change the base font size and font family. Shown below is how to increase the base font size to 15 and change the base font family to Courier.

# use theme to change base font family and font size
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower")  +
  theme_bw(base_size = 15, base_family = "Courier") +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))


It has been bothering me for the last seven paragraphs that my legend title still uses the column name. However, this is an easy fix. All I have to do is add a label to the color aesthetic in the labs() function.

# change legend title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

We can also change the position of the legend. R automatically places legends on the right, and while I like having it to the right in this case, I could also place the legend at the bottom of the graph. This automatically changes the aspect ratio of the graph.

# change legend position
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold")) +
  theme(legend.position = "bottom")


The theme() function is of endless use in ggplot2, and can be used to manually adjust the graph margins and add/remove white space padding. The order of arguments in margin() is counterclockwise — top, right, bottom, left (helpfully remembered by the pneumonic TRouBLe).

# add plot margins
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold")) +
  theme(legend.position = "right") +
  theme(plot.margin = margin(t = 1, r = 1, b = 1, l = 2, unit = "cm"))


I have barely scratched the surface of what can be achieved using ggplot2 in this post. There are hundreds of excellent tutorials online that dive deeper into ggplot2, like this blog post by Cedric Scherer. I have yet to learn so much about this library and data visualization in general, but have hopefully made a solid case for using ggplot2 to create clean and aesthetically-pleasing data visualizations.

Custom Plotting Symbols in R

R has a lot of options for plotting symbols, but sometimes you might want a something not in the base set. This post will cover a way to get a custom plotting symbol in R. The code is available on github.

I will use data is for 24 nations that competed in the 2021 Euros (men’s) tournament, specifically the FIFA rankings of the men’s and women’s national teams (accessed 12 Aug 2021). Using a flag to represent the country seems reasonable, and easier than text or plotting symbol and color combinations. Plus, it might add a bit of visual interest.

Although this might seem specialized, it highlights changing backgrounds in R base graphics, controlling aspect ratios, and adding mathematical notation to plots.

Some prep work before coding comes in handy. First, I downloaded PNG files of each country’s flag and saved it with the country’s name. Next, I checked Wikipedia for the aspect ratio of each flag and added it to my CSV file of the FIFA rankings.

The code begins by loading the library we will be using for a custom PCH and then reading in my data from the CSV. The men’s rankings range from 1 to 62 and the women’s rankings from 2 to 131.

library(png) # library for a custom pch

df <- read.csv('/Users/calvinwhealton/Documents/GitHub/euros_2021_rankings/fifa_national_team_rankings.csv',
               header=T) # read-in data from csv

# calculating ranges for men's and women's teams
men_range <- range(df$Men.s.National.Team)
women_range <- range(df$Women.s.National.Team)

Next, we set the basic parameters of the plot. Using the PNG function allows me to save it directly to a file and have a stable image process I an fine tune. The height and width needed to me modified until the axes were close to the same scale (10 ranks on x = 10 ranks on y). This will make life easier when trying to scale the flags with the aspect ratio.

# setting plot to save as a png file
# checked the output file to make sure
# axes were on the same scale, or close to it

par(mar=c(5,5,5,5)) # plot margins

Now, we can initialize the plot. Because we are not plotting the data yet, we can simply label the axes. Default axis ticks would include 0, so I used more natural axis ticks. The gray background makes flags with white more visible.

# setting up the base scatter plot
# used \ to get the "'" in titles
     ,ylab=expression('Men\'s Ranking')
     ,xlab=expression('Women\'s Ranking')
     ,main='Euros 2021 FIFA Nations\nMen\'s vs Women\'s National Teams'
     ,xlim=rev(women_range) # reverse axes to rank 1 is top right
     ,ylim=rev(men_range) # reverse axes to rank 1 is top right
     ,las=1 # rotate vertical axis labels to be horizontal
     ,xaxt='n' # no default x axis
     ,yaxt='n' # no default y axis

# adding axes
axis(side=1, at=c(1,seq(10,women_range[2],10)),las=1)
axis(side=2, at=c(1,seq(10,men_range[2],12)),las=1)

# adding a gray background to the plot
# many flgs have a white portion that would disappear otherwise

Now comes the fun part. Not all flags are shaped the same. The two extremes in these countries are Switzerland (aspect ratio = 1) and Croatia, Hungary, and North Macedonia (aspect ratio = 2). We will be telling the function the corners of where to plot the flag, so using the same x and y offset for all flags will lead to distortions.

To solve this problem, I set a value for the area of all flags. This will ensure that each country has the same visual impact. After some math, that means the x and y dimensions for each flag can be found based on the aspect ratio. Once we have all these, we can loop through the countries, calculate the coordinates for the flag corners, and use the country’s flag as the plotting symbol.

# setting approximate "area" for each country flag
flag_area <- 25

# looping over all the rankings
for(i in 1:nrow(df)){
  # read in the image and set it to a variable
  img <- readPNG(paste('/Users/calvinwhealton/Documents/GitHub/euros_2021_rankings/flags/',df$Country[i],'.png',sep=''))
  # math for same flag areas (approximately)
  # area = dx*dy = (dy^2)*(aspect_ratio)
  # dy = sqrt(area/aspect_ratio)
  # dx = sqrt(area*aspect_ratio)
  dy <- sqrt(flag_area/df$Aspect.Ratio[i])
  dx <- sqrt(flag_area*df$Aspect.Ratio[i])
  # plot the png image as a raster
  # need the image and coordinates
  # men's team on y axis, women's team on x axis

For a little bit of fun, and to illustrate one way to get Greek letters in a plot notation, I calculated the correlation of these ranks. The last line of this code closes the figure file.

# adding a text correlation
# using linear correlation of ranks, not rank correlation
correl <- round(cor(df$Men.s.National.Team,df$Women.s.National.Team),2)

# adding text notation for correlation
     labels = bquote(paste(," Linear Corr. (",rho,") = ",.(correl),sep=""))

# close file

And we are done! We see that the flags do look to be about the same area and have approximately correct aspect ratios. (I won’t say low long it took me to get that part figured out…) I hope this provides a fun way to change up your plots and help an audience better understand the data. Although I used flags here because it seemed the most natural for this data, any PNG file could be used based on your data and needs.


Introduction to PyBorg – basic setup and running

PyBorg is a new secondary implementation of Borg, written entirely in Python using the Platypus optimization library. PyBorg was developed by Andrew Dircks based on the original implementation in C and it is intended primarily as a learning tool as it is less efficient than the original C version (which you can still use with Python but through the use of the plugin “wrapper” also found in the package). PyBorg can be found in the same repository where the original Borg can be downloaded, for which you can request access here:

This blogpost is intended to demonstrate this new implementation. To follow along, first you need to either clone or download the BitBucket repository after you gain access.

Setting up the required packages is easy. In your terminal, navigate to the Python directory in the repository and install all prerequisites using python install. This will install all requirements (i.e. the Platypus library, numpy, scipy and six) for you in your current environment.

You can test that everything works fine by running the optimization on the DTLZ2 test function, found in The script creates an instance of the problem (as it is already defined in the Platypus library), sets it up as a ploblem for Borg to optimize and runs the algorithm for 10,000 function evaluations:

    # define a DTLZ2 problem instance from the Platypus library
    nobjs = 3
    problem = DTLZ2(nobjs)

    # define and run the Borg algorithm for 10000 evaluations
    algorithm = BorgMOEA(problem, epsilons=0.1)

A handy 3D scatter plot is also generated to show the optimization results.

The repository also comes with two other scripts and
The first demonstrates how to use the Platypus hypervolume indicator at a specified runtime frequency to get learn about its progress as the algorithm goes through function evaluations:

The latter provides more advanced functionality that allows you define custom parameters for Borg. It also includes a function to generate runtime data from the run. Both scripts are useful to diagnose how your algorithm is performing on any given problem.

The rest of this post is a demo of how you can use PyBorg with your own Python model and all of the above. I’ll be using a model I’ve used before, which can be found here, and I’ll formulate it so it only uses the first three objectives for the purposes of demonstration.

The first thing you need to do to optimize your problem is to define it. This is done very simply in the exact same way you’d do it on Project Platypus, using the Problem class:

from fishery import fish_game
from platypus import Problem, Real
from pyborg import BorgMOEA

# define a problem
nVars = 6
nObjs = 3 

problem = Problem(nVars, nObjs) # first input is no of decision variables, second input is no of objectives
problem.types[:] = Real(0, 1) #defines the type and bounds of each decision variable
problem.function = fish_game #defines the model function

This assumes that all decision variables are of the same type and range, but you can also define them individually using, e.g., problem.types[0].

Then you define the problem for the algorithm and set the number of function evaluations:

algorithm = BorgMOEA(problem, epsilons=0.001) #epsilons for each objective # number of function evaluations

If you’d like to also produce a runtime file you can use the detailed_run function included in the demo (in the files referenced above), which wraps the algorithm and runs it in intervals so the progress can be monitored. You can combine it with runtime_hypervolume to also track your hypervolume indicator. To use it you need to define the total number of function evaluations, the frequency with which you’d like the progress to be monitored and the name of the output file. If you’d like to calculate the Hypervolume (you first need to import it from platypus) you also need to either provide a known reference set or define maximum and minimum values for your solutions.

maxevals = 10000
frequency = 100
output = ""
hv = Hypervolume(minimum=[-6000, 0, 0], maximum=[0, 1, 100])

nfe, hyp = detailed_run(algorithm, maxevals, frequency, output, hv)

My full script can be found below. The detailed_run function is an edited version of the default that comes in the demo to also include the hypervolume calculation.

from fishery import fish_game
from platypus import Problem, Real, Hypervolume
from pyborg import BorgMOEA
from runtime_diagnostics import detailed_run

# define a problem
nVars = 6 # no. of decision variables to be optimized
nObjs = 3

problem = Problem(nVars, nObjs) # first input is no of decision variables, second input is no of objectives
problem.types[:] = Real(0, 1)
problem.function = fish_game

# define and run the Borg algorithm for 10000 evaluations
algorithm = BorgMOEA(problem, epsilons=0.001)

# define detailed_run parameters
maxevals = 10000
frequency = 100
output = ""
hv = Hypervolume(minimum=[-6000, 0, 0], maximum=[0, 1, 100])

nfe, hyp = detailed_run(algorithm, maxevals, frequency, output, hv)

# plot the results using matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter([s.objectives[0] for s in algorithm.result],
           [s.objectives[1] for s in algorithm.result],
           [s.objectives[2] for s in algorithm.result])
ax.set_xlabel('Objective 1')
ax.set_ylabel('Objective 2')
ax.set_zlabel('Objective 3')
ax.scatter(-6000, 0, 0, marker="*", c='orange', s=50)

plt.plot(nfe, hyp)
plt.title('PyBorg Runtime Hypervolume Fish game')
plt.xlabel('Number of Function Evaluations')

It produces the following two figures:

MORDM Basics V: WaterPaths Tutorial

The MORDM framework poses four fundamental questions that each corresponds to a section within the framework shown in Figure 1:

  1. What actions can we take to address the problem?
  2. What worlds are we implementing those actions in?
  3. What are acceptable levels of performance for those actions in all worlds?
  4. What controls failures across those worlds?
Figure 1: The MORDM framework (Herman et. al., 2013).

In the previous blog post, we used state-aware ROF triggers to implement drought mitigation and supply management actions for one hydroclimatic and demand scenario. In the first MORDM blog post, we generated multiple deeply-uncertain synthetic realizations of hydroclimatic and demand scenarios. The next logical question would be: how do these actions fare across all worlds and across time?

Setting up the system

To explore this question, we will expand the scope of our test case to include Cary’s two neighboring utilities – Durham and Raleigh – within the Research Triangle. The three utilities aim to form a cooperative water supply and management agreement in which they would like to optimize the following objectives, as stated in Trindade et al (2019):

  1. Maximize supply reliability, REL
  2. Minimize the frequency of implementing water use restrictions, RT
  3. Minimize the net present value of infrastructure investment, NPV
  4. Minimize the financial cost of drought mitigation and debt repayment, FC
  5. Minimize the worst first-percentile cost of the FC

In this example, each objective is a function of short-term risks of failure triggers (stROF) and long term risks of failure triggers (ltROF). The stROFs trigger short-term action, typically on a weekly or monthly basis. The ltROFs trigger action on a multi-year, sometimes decadal, timescale. Recall from a previous post that ROF triggers are state-aware, probabilistic decision rules that dictate how a system responds to risk. Here, we optimize for a Pareto-approximate set of ROF triggers (or risk thresholds) that will result in a range of performance objective tradeoffs. An example of a stROF is the restriction ROF trigger we explored in the post prior to this one.

In addition, an example of an ltROF would be the infrastructure construction ltROF. When this ltROF threshold is crossed, an infrastructure project is ‘built’. Potential infrastructure projects are ordered in a development pathway (Zeff et al 2016), and the ltROF triggers the next infrastructure option in the sequence. The term ‘pathway’ is used as these infrastructure sequences are not fixed, but are state-dependent and can be shuffled to allow the optimization process to discover multiple potential pathways.

Over the next two blog posts, we will explore the interaction between the water-use restriction stROF and the infrastructure ltROF, and how this affects system performance. For now, we will simulate the Research Triangle test case and optimize for the ‘best’ set of ROF triggers using WaterPaths and Borg MOEA.

Using WaterPaths and Borg MOEA

We will be using the WaterPaths utility planning and management tool (Trindade et al, 2020) to simulate the performance of the Research Triangle test case. For clarification, the default simulation within WaterPaths is that of the Research Triangle. The folder that you will be downloading from GitHub already contains all the input, uncertainty and decision variable files required. This tool will be paired with the Borg MOEA (Hadka and Reed, 2013) to optimize the performance objectives in each simulation to obtain a set of Pareto-optimal long- and short-term ROF triggers that result in a non-dominated set of tradeoffs. Jazmin made a few posts that walks through compiling Borg and the installation of different parallel and series versions of Borg that might be helpful to try out before attempting this exercise.

Once you have Borg downloaded and set up, begin by downloading the GitHub repository into a file location on your machine of choice:

git clone

Once all the files are downloaded, compile WaterPaths:

make gcc

To optimize the WaterPaths simulation with Borg, first move the Borg files into the main WaterPaths/borg folder:

mv -r Borg WaterPaths/borg/

This line of code will make automatically make folder called borg within the WaterPaths folder, and copy all the external Borg files into it.

Next, cd into the /borg folder and run make in your terminal. This should a generate a file called libborgms.a. Make a folder called lib within the WaterPaths folder, and move this file into the WaterPaths/lib folder

cp libborgms.a ../lib/

Next, cd back into the main folder and use the Makefile to compile the WaterPaths executable with Borg:

make borg

Great! Now, notice that the /rof_tables_test_problem folder is empty. You will need to generate ROF tables within the WaterPaths environment. To do so, run the script provided in the GitHub repository into your terminal. The script provided should look like this:

#SBATCH -n 16 -N 2 -p normal
#SBATCH --job-name=gen_rof_tables
#SBATCH --output=output/gen_rof_tables.out
#SBATCH --error=error/gen_rof_tables.err
#SBATCH --time=04:00:00
#SBATCH --mail-type=all

       	-t 2344\
       	-r 1000\
       	-C 1\ 
	-m 0\
	-s sample_solutions.csv\
       	-O rof_tables_test_problem/\
       	-e 0\
       	-U TestFiles/rdm_utilities_test_problem_opt.csv\
       	-W TestFiles/rdm_water_sources_test_problem_opt.csv\
       	-P TestFiles/rdm_dmp_test_problem_opt.csv\
	-p false

Replace all the ‘YOUR…’ parameters with your system-specific details. Make two new folders: output/ and error/. Then run the script above by entering

sbatch ./

into your terminal. This should take approximately 50 minutes. Once the ROF tables have been generated, run the script provided. It should look like this:

#SBATCH -n 16 -N 3 -p normal
#SBATCH --job-name=sedento_valley_optimization
#SBATCH --output=output/sedento_valley_optimization.out
#SBATCH --error=error/sedento_valley_optimization.err
#SBATCH --time=04:00:00
#SBATCH --mail-type=all


mpirun ./waterpaths -T ${OMP_NUM_THREADS}\
  -t 2344 -r ${N_REALIZATIONS} -d ${DATA_DIR}\
  -C -1 -O rof_tables_test_problem/ -e 3\
  -U TestFiles/rdm_utilities_test_problem_opt.csv\
  -W TestFiles/rdm_water_sources_test_problem_opt.csv\
  -P TestFiles/rdm_dmp_test_problem_opt.csv\
  -b true -o 200 -n 1000

As usual, replace all the ‘YOUR…’ parameters with your system-specific details. Run this script by entering

sbatch ./

into the terminal. This script runs the Borg MOEA optimization for 1,000 function evaluations, and will output a .set file every 200 function evaluations. At the end of the run, you should have two files within your main folder:

  1. NC_output_MS_S3_N1000.set contains the Pareto-optimal set of decision variables and the performance objective values for the individual utilities and the overall region.
  2. NC_runtime_MS_S3_N1000.runtime contains information on the time it took for 1000 simulations of the optimization of the Research Triangle to complete.

The process should take approximately 1 hour and 40 minutes.


Congratulations, you are officially the Dr Strange of the Research Triangle! You have successfully downloaded WaterPaths and Borg MOEA, as well as run a simulation-optimization of the North Carolina Research Triangle test case across 1000 possible futures, in which you were Pareto-optimal in more than one. You obtained the .set files containing the Pareto-optimal decision variables and their respective performance objective values. Now that we have optimized the performance of the Research Triangle system, we are ready to examine the performance objectives and the Pareto-optimal ROF trigger values that result in this optimal set of tradeoffs.

In the next blog post, we will process the output of the .set files to visualize the objective space, decision variable space, and the tradeoff space. We will also conduct robustness analysis on the Pareto-optimal set to delve further into the question of “What are acceptable levels of performance for those actions in all worlds?”. Finally, we will explore the temporal interactions between the water use restrictions stROF and the infrastructure construction ltROF, and how supply and demand are both impacted by – and have an effect on – these decision variables.


Hadka, D., & Reed, P. (2013). Borg: An auto-adaptive Many-objective evolutionary computing framework. Evolutionary Computation, 21(2), 231–259.

Trindade, B. C., Gold, D. F., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2020). Water pathways: An open source stochastic simulation system for integrated water supply portfolio management and infrastructure investment planning. Environmental Modelling & Software, 132, 104772.

Trindade, B. C., Reed, P. M., & Characklis, G. W. (2019). Deeply uncertain pathways: Integrated multi-city regional water supply infrastructure investment and portfolio management. Advances in Water Resources, 134, 103442.

Zeff, H. B., Herman, J. D., Reed, P. M., & Characklis, G. W. (2016). Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways. Water Resources Research, 52(9), 7327–7346.