Web Scraping: Constructing URLs, Downloading and Unpacking Zipped Files in Python and R

Introduction

Being able to automate data retrieval helps alleviate encountering irritating, repetitive tasks. Often times, these data files can be separated by a range of times, site locations, or even type of measurement, causing them to be cumbersome to download manually. This blog post outlines how to download multiple zipped csv files from a webpage using both R and Python.

We will specifically explore downloading historical hourly locational marginal pricing (LMP) data files from PJM, a regional transmission organization coordinating a wholesale electricity market across 13 Midwestern, Southeastern, and Northeastern states. The files in question are located here.

First, Using Python

A distinct advantage of using Python over R is being able to write fewer lines for the same results. However, the differences in the time required to execute the code are generally minimal between the two languages.

Constructing URLs

The URL for the most recent file from the webpage for August 2017 has the following format: “http://www.pjm.com/pub/account/lmpmonthly/201708-da.csv
Notably, the URL is straightforward in its structure and allows for a fairly simple approach in attempting to create this URL from scratch.

To begin, we can deconstruct the URL into three aspects: the base URL, the date, and the file extension.
The base URL is fairly straightforward because it is constant: “http://www.pjm.com/pub/account/lmpmonthly/
As is the file extension: “-da.csv”

However, the largest issue presented is how to approach recreating the date: “201708”
It is in the form “yyyymm”, requiring a reconstruction of a 4-digit year (luckily these records don’t go back to the first millennium) and a 2-digit month.

In Python, this is excessively easy to reconstruct using the .format() string modifying method. All that is required is playing around with the format-specification mini-language within the squiggly brackets. Note that the squiggly brackets are inserted within the string with the following form: {index:modifyer}

In this example, we follow the format above and deconstruct the URL then reassemble it within one line in Python. Assigning this string to a variable allows for easy iterations through the code by simply changing the inputs. So let’s create a quick function named url_creator that allows for the input of a date in the format (yyyy, mm) and returns the constructed URL as a string:

def url_creator(date):
    """
    Input is list [yyyy, mm]
    """
    yyyy, mm = date

    return "http://www.pjm.com/pub/account/lmpmonthly/{0:}{1:02d}-da.zip".format(yyyy, mm)

To quickly generate a list of the dates we’re interested in, creating a list of with entries in the format (yyyy, mm) for each of the month’s we’re interested in is easy enough using a couple of imbedded for loops. If you need alternative dates, you can easily alter this loop around or manually create a list of lists to accommodate your needs:

#for loop method
dates = []
for i in range(17):
    for j in range(12):
        dates.append([2001 + i, j + 1])

#alternative manual method
dates = [[2011, 1], [2011, 2], [2011, 3]]

Retrieving and Unpacking Zipped Files

Now that we can create the URL for this specific file type by calling the url_creator function and have all of the dates we may be interested in, we need to be able to access, download, unzip/extract, and subsequently save these files. Utilizing the urllib library, we can request and download the zipped file. Note that urllib.request.urlretrieve only retrieves the files and does not attempt to read it. While we could simply save the zipped file at this step, it is preferred to extract it to prevent headaches down the line.

Utilizing the urllib library, we can extract the downloaded files to the specified folder. Notably, I use the operating system interface method getcwd while extracting the file to save the resulting csv file into the directory where this script is running. Following this, the extracted file is closed.

import zipfile, urllib, os
from urllib.request import Request,urlopen, urlretrieve

for date in dates:
    baseurl = url_creator(date)

    local_filename, headers = urllib.request.urlretrieve(url = baseurl)
    zip_ref = zipfile.ZipFile(file = local_filename, mode = 'r')
    zip_ref.extractall(path = os.getcwd())     #os.getcwd() directs to current working directory
    zip_ref.close()

At this point, the extracted csv files will be located in the directory where this script was running.

Alternatively, Using R

We will be using an csv files to specify date ranges instead of for loops as shown in the Python example above. The advantage of using the csv files rather than indexing i=2001:2010 is that if you have a list of non-consecutive months or years, it is sometimes easier to just make a list of the years and cycle through all of the elements of the list rather than trying to figure out how to index the years directly. However, in this example, it would be just as easy to index through the years and month rather than the csv file.

This first block of code sets the working directory and reads in two csv files from the directory. The first csv file contains a list of the years 2001-2010. The second CSV file lists out the months January-December as numbers.

#Set working directory
#specifically specified for the directory where we want to "dump" the data
setwd("E:/Data/WebScraping/PJM/Markets_and_operations/Energy/Real_Time_Monthly/R")

#Create csv files with a list of the relevant years and months
years=read.csv("Years.csv")
months=read.csv("Months.csv")

The breakdown of the URL was previously shown. To create this in R, we will generate a for-loop that will change this link to cycle through all the years and months listed in the csv files. The index i and j iterate through the elements in the year and month csv file respectively. (The if statement will be discussed at the end of this section). We then use the “downloadurl” and “paste” functions to turn the entire link into a string. The parts of the link surrounded in quotes are unchanging and denoted in green. The “as.character” function is used to cycle in the elements from the csv files and turn them into characters. Finally, the sep=’’ is used to denote that there should be no space separating the characters. The next two lines download the file into R as a temporary file.

#i indexes through the years and j indexes through the months
for (i in (1:16)){
  for( j in (1:12)){
      if (j>9){

    #dowload the url and save it as a temporary file

    downloadurl=paste("http://www.pjm.com/pub/account/lmpmonthly/",as.character(years[i,1]),as.character(months[j,1]),"-da.zip",sep="")
    temp="tempfile"
    download.file(downloadurl,temp)

Next, we unzip the temporary file and finds the csv (which is in the form: 200101-da.csv), reading in the csv file into the R environment as “x”. We assign the name “200101-da” to “x”. Finally, it is written as a csv file and stored in the working directory, the temporary file is deleted, and the for-loop starts over again.

    #read in the csv into the global environment
    x=read.csv((unz(temp,paste(as.character(years[i,1]),as.character(months[j,1]),"-da.csv",sep=""))))
    #Assign a name to the csv "x"
    newname=paste(as.character(years[i,1]),as.character(months[j,1]),sep="")
    assign(newname,x)
    #create a csv that is stored in your working directory
    write.csv(x,file=paste(newname,".csv",sep=""))

    unlink(temp) #deletetempfile
      }

    #If j is between 1 and 10, the only difference is that a "0" has to be added
    #in front of the month number

Overcoming Formatting

The reason for the “if” statement is that it is not a trivial process to properly format the dates in Excel when writing to a csv to create the list of dates. When inputing “01”, the entry is simply changed to “1” when the file is saved as a csv. However, note that in the name of the files, the first nine months have a 0 preceding their number. Therefore, the URL construct must be modified for these months. The downloadurl has been changed so that a 0 is added before the month number if j < 10. Aside from the hard-coded zeroes, the block of code is the same as the one above.

else { 

        downloadurl=paste("http://www.pjm.com/pub/account/lmpmonthly/",as.character(years[i,1]),"0",as.character(months[j,1]),"-da.zip",sep="")

        temp="tempfile"
        download.file(downloadurl,temp)

        #x=read.csv(url(paste("http://www.pjm.com/pub/account/lmpmonthly/",as.character(years[i,1],as.character(months[j,1],sep=""))))
        x=read.csv((unz(temp,paste(as.character(years[i,1]),"0",as.character(months[j,1]),"-da.csv",sep=""))))
        newname=paste(as.character(years[i,1]),"0",as.character(months[j,1]),sep="")
        assign(newname,x)
        write.csv(x,file=paste(newname,".csv",sep=""))

        unlink(temp) #deletetempfile

}

Acknowledgements

This was written in collaboration with Rohini Gupta who contributed by defining the problem and explaining her approach in R.

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1650441. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Versions: R = 3.4.1 Python = 3.6.2
To see the code in its entirety, please visit the linked GitHub Repository.

Advertisements

Generating Interactive Visuals in R

Visuals vs. Visual Analytics 

How do visuals differ from visual analytics? In a scientific sense, a visual is a broad term for any picture, illustration, or graph that can be used to convey an idea. However, visual analytics is more than just generating a graph of complex data and handing it to a decision maker. Visual analytic tools help create graphs that allow the user to interact with the data, whether that involves manipulating a graph in three-dimensional space or allowing users to filter or brush for solutions that match certain criteria. Ultimately, visual analytics seeks to help in making decisions as fast as possible and to “enable learning through continual [problem] reformulation” (Woodruff et al., 2013) by presenting large data sets in an organized way so that the user can better recognize patterns and make inferences.

My goal with this blog post is to introduce two R libraries that are particularly useful to develop interactive graphs that will allow for better exploration of a three-dimensional space. I have found that documentation on these libraries and potential errors was sparse, so this post will consolidate my hours of Stack Overflow searching into a step-by-step process to produce beautiful graphs!

R Libraries

            Use rgl to create a GIF of a 3D graph

Spinning graphs can be especially useful to visualize a 3D Pareto front and a nice visualization for a Power Point presentation. I will be using an example three-objective Pareto set from Julie’s work on the Red River Basin for this tutorial. The script has been broken down and explained in the following sections.


#Set working directory
setwd("C:/Users/Me/Folder/Blog_Post_1")

#Read in in csv of pareto set
data=read.csv("pareto.csv")

#Create three vectors for the three objectives
hydropower=data$WcAvgHydro
deficit=data$WcAvgDeficit
flood=data$WcAvgFlood

In this first block of code, the working directory is set, the data set is imported from a CSV file, and each column of the data frame is saved as a vector that is conveniently named. Now we will generate the plot.


#call the rgl library
library(rgl)

#Adjust the size of the window
par3d(windowRect=c(0,0,500,500))

If the rgl package isn’t installed on your computer yet, simply type install.packages(“rgl”) into the console. Otherwise, use the library function in line 2 to call the rgl package. The next line of code is used to adjust the window that the graph will pop up in. The default window is very small and as such, the movie will have a small resolution if the window is not adjusted!


#Plot the set in 3D space

plot3d(hydropower,deficit,flood,col=brewer.pal(8,"Blues"), size=2, type='s', alpha=0.75)

Let’s plot these data in 3D space. The first three components of the plot3d function are the x,y, and z vectors respectively. The rest of the parameters are subject to your personal preference. I used the Color Brewer (install package “RColorBrewer”) to color the data points in different blue gradients. The first value is the number of colors that you want, and the second value is the color set. Color Brewer sets can be found here: http://www.datavis.ca/sasmac/brewerpal.html. My choice of colors is random, so I opted not to create a color scale. Creating a color scale is more involved in rgl. One option is to split your data into classes and to use legend3d and the cut function to cut your legend into color levels. Unfortunately, there simply isn’t an easy way to create a color scale in rgl. Finally, I wanted my data points to be spheres, of size 2, that were 50% transparent, which is specified with type, size, and alpha respectively. Plot3d will open a window with your graph. You can use your mouse to rotate it.

Now, let’s make  a movie of the graph. The movie3d function requires that you install ImageMagick, a software that allows you to create a GIF from stitching together multiple pictures. ImageMagick also has cool functionalities like editing, resizing, and layering pictures. It can be installed into your computer through R using the first two lines of code below. Make sure not to re-run these lines once ImageMagick is installed.  Note that ImageMagick doesn’t have to be installed in your directory, just on your computer.


require(installr)
install.ImageMagick()

#Create a spinning movie of your plot
movie3d(spin3d(axis = c(0, 0, 1)), duration = 20,
 dir = getwd())

Finally, the last line of code is used to generate the movie. I have specified that I want the plot to spin about the z axis, specified a duration (you can play around with the number to see what suits your data), and that I want the movie to be saved in my current working directory. The resulting GIF is below. If the GIF has stopped running, reload the page and scroll down to this section again.

movie.gif

I have found that creating the movie can be a bit finicky and the last step is where errors usually occur. When you execute your code, make sure that you keep the plot window open while ImageMagick stitches together the snapshots otherwise you will get an error. If you have errors, please feel free to share because I most likely had them at one point and was able to ultimately fix them.

Overall, I found this package to be useful for a quick overview of the 3D space, but I wasn’t pleased with the way the axes values and titles overlap sometimes when the graph spins. The way to work around this is to set the labels and title to NULL and insert your own non-moving labels and title when you add the GIF to a PowerPoint presentation.

            Use plotly to create an interactive scatter

I much prefer the plotly package to rgl for the aesthetic value, ease of creating a color scale, and the ability to mouse-over points to obtain coordinate values in a scatter plot. Plotly is an open source JavaScript graphing library but has an R API. The first step is to create a Plotly account at: https://plot.ly/. Once you have confirmed your email address, head to https://plot.ly/settings/api to get an API key. Click the “regenerate key” button and you’ll get a 20 character key that will be used to create a shareable link to your chart. Perfect, now we’re ready to get started!

setwd("C:/Users/Me/Folder/Blog_Post_1")

library(plotly)
library(ggplot2)

#Set environment variables

Sys.setenv("plotly_username"="rg727")
Sys.setenv("plotly_api_key"="insert key here")

#Read in pareto set data

pareto=read.csv("ieee_synthetic_thinned.csv")

Set the working directory, install the relevant libraries, set the environment variables and load in the data set. Be sure to insert your API key. You will need to regenerate a new key every time you make a new graph. Also, note that your data must be in the form of a data frame for plotly to work.

#Plot your data

plot= plot_ly(pareto, x = ~WcAvgHydro, y = ~WcAvgDeficit, z = ~WcAvgFlood, color = ~WcAvgFlood, colors = c('#00d6f7', '#ebfc2f'))
add_markers()

#Add axes titles
layout(title="Pareto Set", scene = list(xaxis = list(title = 'Hydropower'),yaxis = list(title = 'Deficit'), zaxis = list(title = 'Flood')))
#call the name of your plot to appear in the viewer
plot

To correctly use the plotly command, the first input needed is the data frame, followed by the column names of the x,y, and z columns in the data frame. Precede each column name with a “~”.

I decided that I wanted the colors to scale with the value of the z variable. The colors were defined using color codes available at http://www.color-hex.com/. Use the layout function to add a main title and axis labels. Finally call the name of your plot and you will see it appear in the viewer at the lower right of your screen.If your viewer shows up blank with only the color scale, click on the viewer or click “zoom”. Depending on how large the data set is, it may take some time for the graph to load.


#Create a link to your chart and call it to launch the window
chart_link = api_create(plot, filename = "public-graph")
chart_link

Finally create the chart link using the first line of code above and the next line will launch the graph in Plotly. Copy and save the URL and anyone with it can access your graph, even if they don’t have a Plotly account. Play around with the cool capabilities of my Plotly graph, like mousing over points, rotating, and zooming!

https://plot.ly/~rg727/1/

 

Sources: 

http://www.sthda.com/english/wiki/a-complete-guide-to-3d-visualization-device-system-in-r-r-software-and-data-visualization

https://plot.ly/r/3d-scatter-plots/

Woodruff, M.J., Reed, P.M. & Simpson, T.W. Struct Multidisc Optim (2013) 48: 201. https://doi.org/10.1007/s00158-013-0891-z

James J. Thomas and Kristin A. Cook (Ed.) (2005). Illuminating the Path: The R&D Agenda for Visual Analytics National Visualization and Analytics Center.

Alluvial Plots

Alluvial Plots

We all love parallel coordinates plots and use them all the time to display our high dimensional data and tell our audience a good story. But sometimes we may have large amounts of data points whose tradeoffs’ existence or lack thereof cannot be clearly verified, or the data to be plotted is categorical and therefore awkwardly displayed in a parallel coordinates plot.

One possible solution to both issues is the use of alluvial plots. Alluvial plots work similarly to parallel coordinates plots, but instead of having ranges of values in the axes, it contains bins whose sizes in an axis depends on how many data points belong to that bin. Data points that fall within the same categories in all axes are grouped into alluvia (stripes), whose thicknesses reflect the number of data points in each alluvium.

Next are two examples of alluvial plots, the fist displaying categorical data and the second displaying continuous data that would normally be plotted in a parallel coordinates plot. After the examples, there is code available to generate alluvial plots in R (I know, I don’t like using R, but creating alluvial plots in R is easier than you think).

Categorical data

The first example (Figure 1) comes from the cran page for the alluvial plots package page. It uses alluvial plots to display data about all Titanic’s passengers/crew and group them into categories according to class, sex, age, and survival status.

alluvial_cran

Figure 1 – Titanic passenger/crew data. Yellow alluvia correspond to survivors and gray correspond to deceased. The size of each bin represents how many data points (people) belong to that category in a given axis, while the thickness of each alluvium represent how many people fall within the same categories in all axes. Source: https://cran.r-project.org/web/packages/alluvial/vignettes/alluvial.html.

Figure 1 shows that most of the passengers were male and adults, that the crew represented a substantial amount of the total amount of people in the Titanic, and that, unfortunately, there were more deceased than survivors. We can also see that a substantial amount of the people in the boat were male adult crew members who did not survive, which can be inferred by the thickness of the grey alluvium that goes through all these categories — it can also be seen by the lack of an alluvia hitting the Crew and Child bins, that (obviously) there were no children crew members. It can be also seen that 1st class female passengers was the group with the greatest survival rate (100%, according to the plot), while 3rd class males had the lowest (ballpark 15%, comparing the yellow and gray alluvia for 3rd class males).

Continuous data

The following example shows the results of policy modeling for a fictitious water utility using three different policy formulations. Each data point represents the modeled performance of a given candidate policy in six objectives, one in each axis. Given the uncertainties associated with the models used to generate this data, the client utility company is more concerned about whether or not a candidate policy would meet certain performance criteria according to the model (Reliability > 99%, Restriction Frequency < 20%, and Financial Risk < 10%) than about the actual objective values. The utility also wants to have a general idea of the tradeoffs between objectives.

Figure 2 was created to present the modeling results to the client water utility. The colored alluvia represent candidate policies that meet the utility’s criteria, and grey lines represent otherwise. The continuous raw data used to generate this plot was categorized following ranges whose values are meaningful to the client utility, with the best performing bin always put in the bottom of the plot. It is important to notice that the height of the bins represent the number of policies that belong to that bin, meaning that the position of the gap between two stacked bins does not represent a value in an axis, but the fraction of the policies that belong to each bin. It can be noticed from Figure 2 that it is relatively difficult for any of the formulations to meet the Reliability > 99% criteria established by the utility. It is also striking that a remarkably small number of policies from the first two formulations and none of the policies from the third formulation meet the criteria established by the utilities. It can also be easily seen by following the right alluvia that the vast majority of the solutions with smaller net present costs of infrastructure investment obtained with all three formulations perform poorly in the reliability and restriction frequency objectives, which denotes a strong tradeoff. The fact that such tradeoffs could be seen when the former axis is on the opposite side of the plot to the latter two is a remarkable feature of alluvial plots.

tradeoffs_all3.svg.png

Figure 2 – Alluvial plot displaying modeled performance of candidate long-term planning policies. The different subplots show different formulations (1 in the top, 3 in the bottom).

The parallel coordinates plots in Figure 3 displays the same information as the alluvial plot in Figure 2. It can be readily seen that the analysis performed above, especially when it comes to the tradeoffs, would be more easily done with Figure 2 than with Figure 3. However, if the actual objective values were important for the analysis, Figure 3 would be needed either by itself or in addition to Figure 2, the latter being used likely as a pre-screening or for a higher level analysis of the results.

tradeoffs_all-svg

Figure 3 – Parallel coordinates plot displaying modeled performance of candidate long-term planning policies. The different subplots show different formulations (1 in the top, 3 in the bottom).

The R code used to create Figure 1 can be found here. The code below was used to create Figure 2 — The packages “alluvia”l and “dplyr” need to be installed before attempting to use the provided code, for example using the R command install.packages(package_name). Also, the user needs to convert its continuous data into categorical data, so that each row corresponds to a possible combination of bins in all axis (one column per axis) plus a column (freqs) representing the frequencies with which each combination of bins is seen in the data.

# Example datafile: snippet of file "infra_tradeoffs_strong_freqs.csv"
Reliability, Net Present Cost of Inf. Investment, Peak Financial Costs, Financial Risk, Restriction Frequency, Jordan Lake Allocation, freqs
2<99,0<60,0<25,0<10,2>20,0<70,229
0>99,2>60,0<25,0<10,2>20,0<70,0
2<99,2>60,0<25,0<10,2>20,0<70,168
0>99,0<60,2>25,0<10,2>20,0<70,0
2<99,0<60,2>25,0<10,2>20,0<70,3
0>99,2>60,2>25,0<10,2>20,0<70,2
2<99,2>60,2>25,0<10,2>20,0<70,45
0>99,0<60,0<25,2>10,2>20,0<70,0
2<99,0<60,0<25,2>10,2>20,0<70,317
0>99,2>60,0<25,2>10,2>20,0<70,0
2<99,2>60,0<25,2>10,2>20,0<70,114
# load packages and prepare data
library(alluvial)
library(dplyr)

itss <- read.csv('infra_tradeoffs_strong_freqs.csv')
itsw <- read.csv('infra_tradeoffs_weak_freqs.csv')
itsn <- read.csv('infra_tradeoffs_no_freqs.csv')

# preprocess the data (convert do dataframe)
itss %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_strong
itsw %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_weak
itsn %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_no

# setup output file
svg(filename="tradeoffs_3_formulations.svg",
width=8,
height=8,
pointsize=18)
p <- par(mfrow=c(3,1))
par(bg = 'white')

# create the plots
alluvial(
its_strong[,1:6],
freq=its_strong$n,
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_strong$Financial.Risk == "0<10", "blue", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_strong$Financial.Risk == "0<10", "blue", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_strong$n < 1
)
alluvial(
its_weak[,1:6],
freq=its_weak$n,
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_weak$Financial.Risk == "0<10", "chartreuse2", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_weak$Financial.Risk == "0<10", "chartreuse2", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_weak$n < 1
)
alluvial(
its_no[,1:6],
freq=its_no$n,
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_no$Financial.Risk == "0<10", "red", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_no$Financial.Risk == "0<10", "red", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_no$n < 1
)
dev.off()

Root finding in MATLAB, R, Python and C++

In dynamical systems, we are often interested in finding stable points, or equilibria. Some systems have multiple equilibria. As an example, take the lake problem, which is modeled by the equation below where Xt is the lake P concentration, at are the anthropogenic P inputs, Yt~LN(μ,σ2)  are random natural P inputs, b is the P loss rate, and q is a shape parameter controlling the rate of P recycling from the sediment. The first three terms on the right hand side make up the “Inputs” in the figure, while the last term represents the “Outputs.” A lake is in equilibrium when the inputs are equal to the outputs and the lake P concentration therefore is not changing over time.

lakeModel

For irreversible lakes this occurs at three locations, even in the absence of anthropogenic and natural inputs: an oligotrophic equilibrium, an unstable equilibrium (called the critical P threshold) and a eutrophic equilibrium (see figure below).

PcritDiagram

The unstable equilibrium in this case is called the critical P threshold because once it is crossed, it is impossible to return to an oligotrophic equilibrium by reducing anthropogenic and natural P inputs alone. In irreversible lakes like this, we would therefore like to keep the lake P concentration below the critical P threshold. How do we find the critical P threshold? With a root finding algorithm!

As stated earlier, the system above will be in equilibrium when the inputs are equal to the outputs and the P concentration is not changing over time, i.e. when

X_{t+1} - X_t = \frac{X^q_t}{1+X^q_t} - bX_t = 0

Therefore we simply need to find the zero, or “root” of the above equation.  Most of the methods for this require either an initial estimate or upper and lower bounds on the location of the root. These are important, since an irreversible lake will have three roots. If we are only interested in the critical P threshold, we have to make sure that we provide an estimate which leads to the unstable equilibrium, not either of the stable equilibria. If possible, you should plot the function whose root you are finding to make sure you are giving a good initial estimate or bounds, and check afterward to ensure the root that was found is the one you want! Here are several examples of root-finding methods in different programming languages.

In MATLAB, roots can be found with the function fzero(fun,x0) where ‘fun’ is the function whose root you want to find, and x0 is an initial estimate. This function uses Brent’s method, which combines several root-finding methods: bisection, secant, and inverse quadratic interpolation. Below is an example using the lake problem.

myfun = @(x,b,q) x^q/(1+x^q)-b*x;
b = 0.42;
q = 2.0;
fun = @(x) myfun(x,b,q);
pcrit = fzero(fun,0.75);

This returns pcrit = 0.5445, which is correct. If we had provided an initial estimate of 0.25 instead of 0.75, we would get pcrit = 2.6617E-19, basically 0, which is the oligotrophic equilibrium in the absence of anthropogenic and natural P inputs. If we had used 1.5 as an initial estimate, we would get pcrit = 1.8364, the eutrophic equilibrium.

MatlabScreenShot

In R, roots can be found with the function uniroot, which also uses Brent’s method. Dave uses this on line 10 of the function lake.eval in his OpenMORDM example. Instead of taking in an initial estimate of the root, this function takes in a lower and upper bound. This is safer, as you at least know that the root estimate will lie within these bounds. Providing an initial estimate that is close to the true value should do well, but is less predictable; the root finding algorithm may head in the opposite direction from what is desired.

b <- 0.42
q <- 2.0
pcrit <- uniroot(function(x) x^q/(1+x^q) - b*x, c(0.01, 1.5))$root

This returns pcrit = 0.5445145. Good, we got the same answer as we did with MATLAB! If we had used bounds of c(0.75, 2.0) we would have gotten 1.836426, the eutrophic equilibrium.

What if we had given bounds that included both of these equilibria, say c(0.5, 2.0)? In that case, R returns an error: ‘f() values at end points not of opposite sign’. That is, if the value returned by f(x) is greater than 0 for the lower bound, it must be less than 0 for the upper bound and vice versa. In this case both f(0.5) and f(2.0) are greater than 0, so the algorithm fails. What if we gave bounds for which one is greater than 0 and another less, but within which there are multiple roots, say c(-0.5,2.0)? Then R just reports the first one it finds, in this case pcrit = 0.836437, the eutrophic equilibrium. So it’s important to make sure you pick narrow enough bounds that include the root you want, but not roots you don’t!

RscreenShot

In Python, you can use either scipy.optimize.root or scipy.optimize.brentq, which is what Jon uses on line 14 here. scipy.optimize.root can be used with several different algorithms, but the default is Powell’s hybrid method, also called Powell’s dogleg method. This function only requires an initial estimate of the root.

from scipy.optimize import root
b = 0.42
q = 2.0
pcrit = root(lambda x: x**(1+x**q) - b*x, 0.75)

scipy.optimize.root returns an object with several attributes. The attribute of interest to us is the root, represented by x, so we want pcrit.x. In this case, we get the correct value of 0.54454. You can play around with initial estimates to see how pcrit.x changes.

PythonScreenShot1

Not surprisingly, scipy.optimize.brentq uses Brent’s method and requires bounds as an input.

from scipy.optimize import brentq as root
b = 0.42
q = 2.0
pcrit = root(lambda x: x**(1+x**q) - b*x, 0.01, 1.5)

This just returns the root itself, pcrit = 0.5445. Again, you can play around with the bounds to see how this estimate changes.

PythonScreenShot2

In C++, Dave again shows how this can be done in the function ‘main-lake.cpp’ provided in the Supplementary Material to OpenMORDM linked from this page under the “Publications” section. On lines 165-168 he uses the bisect tool to find the root of the function given on lines 112-114. I’ve copied the relevant sections of his code into the function ‘find_Pcrit.cpp’ below.


#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <iostream>
#include <boost/math/tools/roots.hpp>

namespace tools = boost::math::tools;
using namespace std;

double b, q, pcrit;

double root_function(double x) {
  return pow(x,q)/(1+pow(x,q)) - b*x;
}

bool root_termination(double min, double max) {
  return abs(max - min) <= 0.000001;
}

int main(int argc, char* argv[])
{
  b = 0.42;
  q = 2.0;

  std::pair<double, double> result = tools::bisect(root_function, 0.01, 1.0, root_termination);
  pcrit = (result.first + result.second)/2;
  cout << pcrit << endl;
}

This yields the desired root of pcrit = 0.54454, but of course, changing the bounds may result in different estimates. In case you missed it, the take home message is to be careful about your initial estimate and bounds ;).

CPPscreenShot

 

From Writing NetCDF Files in C to Loading NetCDF Files in R

So much data from such little models…

It’s been my experience that even simple models can generate lots of data. If you’re a regular reader of this blog, I can imagine you’ve had similar experiences as well. My most recent experience with this is the work I’ve done with the Dynamic Integrated Climate-Economic model (DICE). I had inherited a port of the 2007 version of the model, which would print relevant output to the screen. During my initial runs with the model, I would simply redirect the output to ascii files for post-processing. I knew that eventually I would be adding all sorts of complexity to this model, ultimately leading to high-dimensional model output and rendering the use of ascii files as impractical. I knew that I would need a better way to handle all this data. So in updating the model to the 2013 version, I decided to incorporate support for netCDF file generation.

You can find details about the netCDF file format through Unidata (a University Cooperation for Atmospheric Research [UCAR] Community Program) and through some of our previous blog posts (here, here, and here). What’s important to note here is that netCDF is a self-describing file format designed to manage high-dimensional hierarchical data sets.

I had become accustomed to netCDF files in my previous life as a meteorologist. Output from complex numerical weather prediction models would often come in netCDF format. While I had never needed to generate my own netCDF output files, I found it incredibly easy and convenient to process them in R (my preferred post-processing and visualization software). Trying to incorporate netCDF output support in my simple model seemed daunting at first, but after a few examples I found online and a little persistence, I had netCDF support incorporated into the DICE model.

The goal of this post is to guide you through the steps to generate and process a netCDF file. Some of our earlier posts go through a similar process using the Python and Matlab interfaces to the netCDF library. While I use R for post-processing, I generally use C/C++ for the modeling; thus I’ll step through generating a netCDF file in C and processing the generated netCDF file in R on a Linux machine.

Edit:  I originally put a link to following code at the bottom of this post.  For convenience, here’s a link to the bitbucket repository that contains the code examples below.

Writing a netCDF file in C…

Confirm netCDF installation

First, be sure that netCDF is installed on your computing platform. Most scientific computing clusters will have the netCDF library already installed. If not, contact your system administrator to install the library as a module. If you would like to install it yourself, Unidata provides the source code and great documentation to step you through the process. The example I provide here isn’t all that complex, so any recent version (4.0+) should be able to handle this with no problem.

Setup and allocation

Include the header files

With the netCDF libraries installed, you can now begin to code netCDF support into your model. Again, I’ll be using C for this example. Begin by including the netCDF header file with your other include statements:

#include <stdlib.h>
#include <stdio.h>
#include <netcdf.h>

Setup an error handler

The netCDF library includes a nice way of handling possible errors from the various netCDF functions. I recommend writing a simple wrapper function that can take the returned values of the netCDF functions and produce the appropriate error message if necessary:

void ncError(int val)
{
  printf("Error: %s\n", nc_strerror(val));
  exit(2);
}

Generate some example data

Normally, your model will have generated important data at this point. For the sake of the example, let’s generate some data to put into a netCDF file:

  // Loop control variables
  int i, j, k;
  
  // Define the dimension sizes for
  // the example data.
  int dim1_size = 10;
  int dim2_size = 5;
  int dim3_size = 20;
  
  // Define the number of dimensions
  int ndims = 3;
  
  // Allocate the 3D vectors of example data
  float x[dim1_size][dim2_size][dim3_size]; 
  float y[dim1_size][dim2_size][dim3_size];
  float z[dim1_size][dim2_size][dim3_size];
  
  // Generate some example data
  for(i = 0; i < dim1_size; i++) {
      for(j = 0; j < dim2_size; j++) {
          for(k = 0; k < dim3_size; k++) {
              x[i][j][k] = (i+j+k) * 0.2;
              y[i][j][k] = (i+j+k) * 1.7;
              z[i][j][k] = (i+j+k) * 2.4;
          }
      }
    }

This generates three variables, each with three different size dimensions. Think of this, for example, as variables on a 3-D map with dimensions of [latitude, longitude, height]. In my modeling application, my dimensions were [uncertain state-of-the-world, BORG archive solution, time].

Allocate variables for IDs

Everything needed in creating a netCDF file depends on integer IDs, so the next step is to allocate variables for the netCDF file id, the dimension ids, and the variable ids:

// Allocate space for netCDF dimension ids
int dim1id, dim2id, dim3id;
  
// Allocate space for the netcdf file id
int ncid;
  
// Allocate space for the data variable ids
int xid, yid, zid;

Each one of these IDs will be returned through reference by the netCDF functions. While we’re at it, let’s make a variable to hold the return status of the netCDF function calls:

// Allocate return status variable
int retval;

Define the meta-data

Now we will start to build the netCDF file. This is a two-part process. The first part is defining the meta-data for the file and the second part is assigning the data.

Create an empty netCDF file

First, create the file:

// Setup the netcdf file
if((retval = nc_create("example.nc", NC_NETCDF4, &ncid))) { ncError(retval); }

Note that we store the return status of the function call in retval and test the return status for an error. If there’s an error, we pass retval to our error handler. The first parameter to the function call is the name of the netCDF file. The second parameter is a flag that determines the type of netCDF file. Here we use the latest-and-greatest type of NETCDF4, which includes the HDF5/zlib compression features. If you don’t need these features, or you need a version compatible with older versions of netCDF libraries, then use the default or 64-bit offset (NC_64BIT_OFFSET) versions. The third parameter is the netCDF integer ID used for assigning variables to this file.

 Add the dimensions

Now that we have a clean netCDF file to work with, let’s add the dimensions we’ll be using:

 // Define the dimensions in the netcdf file
 if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
 if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
 if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }
  
 // Gather the dimids into an array for defining variables in the netcdf file
 int dimids[ndims];
 dimids[0] = dim1id;
 dimids[1] = dim2id;
 dimids[2] = dim3id;

Just as before, we catch and test the function return status for any errors. The function nc_def_dim() takes four parameters. First is the netCDF file ID returned when we created the file. The second parameter is the name of the dimension. Here we’re using “dimX_size” – you would want to use something descriptive of this dimension (i.e. latitude, time, solution, etc.). The third parameter is the size of this dimension (i.e. number of latitude, number of solutions, etc.). The last is the ID for this dimension, which will be used in the next step of assigning variables. Note that we create an array of the dimension IDs to use in the next step.

 Add the variables

The last step in defining the meta-data for the netCDF file is to add the variables:

// Define the netcdf variables
if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

The nc_def_var() function takes 6 parameters. These include (in order) the netCDF file ID, the variable name to be displayed in the file, the type of data the variable contains, the number of dimensions of the variable, the IDs for each of the dimensions, and the variable ID (which is returned through reference). The type of data in our example is NC_FLOAT, which is a 32-bit floating point. The netCDF documentation describes the full set of data types covered. The IDs for each dimension are passed as that combined array of dimension IDs we made earlier.

 Optional: Add variable attributes

This part is optional, but is incredibly useful and true to the spirit of making a netCDF file. When sharing a netCDF file, the person receiving the file should have all the information they need about the data within the file itself. This can be done by adding “attributes”. For example, let’s add a “units” attribute to each of the variables:

 // OPTIONAL: Give these variables units
 if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
 if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
 if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }

The function nc_put_att_text() puts a text-based attribute onto a variable. The function takes the netCDF ID, the variable ID, the name of the attribute, the length of the string of characters for the attribute, and the text associated with the attribute. In this case, we’re adding an attribute called “units”. Variable ‘x’ has units of “cm”, which has a length of 2. Variable ‘y’ has units of “degC”, which has a length of 4 (and so on). You can apply text-based attributes as shown here or numeric-based attributes using the appropriate nc_put_att_X() function (see documentation for the full list of numeric attribute functions). You can also apply attributes to dimensions by using the appropriate dimension ID or set a global attribute using the ID “0” (zero).

 End the meta-data definition portion

At this point, we’ve successfully created a netCDF file and defined the necessary meta-data. We can now end the meta-data portion:

 // End "Metadata" mode
  if((retval = nc_enddef(ncid))) { ncError(retval); }

…and move on to the part 2 of the netCDF file creation process.

Populate the file with data

Put your data into the netCDF file

Here, all we do is put data into the variables we defined in the file:

 // Write the data to the file
 if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
 if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
 if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }

The function nc_put_var() takes three parameters: the netCDF file ID, the variable ID, and the memory address of the start of the multi-dimensional data array. At this point, the data will be written to the variable in the netCDF file. There is a way to write to the netCDF file in data chunks, which can help with memory management, and a way to use parallel I/O for writing data in parallel to the file, but I have no experience with that (yet). I refer those interested in these features to the netCDF documentation.

Finalize the netCDF file

That’s it! We’re done writing to the netCDF file. Time to close it completely:

 // Close the netcdf file
 if((retval = nc_close(ncid))) { ncError(retval); }

Compile and run the code

Let’s compile and run the code to generate the example netCDF file:

gcc -o netcdf_example netcdf_write_example.c -lnetcdf

Some common problems people run into here are not including the netCDF library flag at the end of the compilation call, not having the header files in the include-path, and/or not having the netCDF library in the library-path. Check your user environment to make sure the netCDF paths are included in your C_INCLUDE_PATH and LIBRARY_PATH:

env | grep –i netcdf

Once the code compiles, run it to generate the example netCDF file:

./netcdf_example

If everything goes according to plan, there should be a file called “example.nc” in the same directory as your compiled code. Let’s load this up in R for some post-processing.

 Reading a netCDF file in R…

Install and load the “ncdf4” package

To start using netCDF files in R, be sure to install the netCDF package “ncdf4”:

install.packages("ncdf4")
library(ncdf4)

Note that there’s also an “ncdf” package. The “ncdf” package reads and writes the classic (default) and 64-bit offset versions of netCDF file. I recommend against using this package as the new package “ncdf4” can handle the old file versions as well as the new netCDF4 version.  Turns out the “ncdf” package has been removed from the CRAN repository.  It’s just as well since the new “ncdf4” package obsoletes the “ncdf” package.


Open the netCDF file

With the library installed and sourced, let’s open the example netCDF file we just created:

 nc <- nc_open("example.nc")

This stores an open file handle to the netCDF file.

View summary of netCDF file

Calling or printing the open file handle will produce a quick summary of the contents of the netCDF file:

 print(nc)

This summary produces the names of the available variables, the appropriate dimensions, and any global/dimension/variable attributes.

Extract variables from the netCDF file

To extract those variables, use the command:

x <- ncvar_get(nc, "x")
y <- ncvar_get(nc, "y")
z <- ncvar_get(nc, "z")

At this point, the data you extracted from the netCDF file are loaded into your R environment as 3-dimensional arrays. You can treat these the same as you would any multi-dimensional array of data (i.e. subsetting, plotting, etc.). Note that the dimensions are reported in reverse order from which you created the variables.

dim(x)

 Close the netCDF file

When you’re done, close the netCDF file:

nc_close(nc)

And there you have it! Hopefully this step-by-step tutorial has helped you incorporate netCDF support into your project. The code I described here is available through bitbucket.

Happy computing!

~Greg

Visualization strategies for multidimensional data

This is the first part of a series of blog posts on multidimensional data visualization strategies.   The main objectives of this first part are:

  1. Show you how to expand plotting capabilities by modifying matplotlib source code.
  2. Generate a tailored 6-D Pareto front plot with completely customized legends.
  3. Provide a glimpse of a recently developed Pareto front video repository in R.

1. Expanding matplotlib capabilities

Keeping in mind that matplotlib is an opensource project developed in the contributors’ free time, there is no guarantee that features that contributors make will be added straightaway.  In my case, I needed the marker rotation capabilities in a 3 D scatter plot.  Luckily, someone already had figured out how to do so and started a pull request in the matplotlib github repository but this change has not yet been implemented.  Since I couldn’t wait for the changes to happen, here’s the straightforward solution that I found:

Here’s  the link to the  pull request that I am referring to.

First, I located where Matplotlib lives in my computer, the path in my case is:

C:/Python27/matplotlib

Then, I located the files that the contributor changed.  The files’ paths are circled in red in the following snippets of the pull request:

githubsnippet.png

github_snippet2.png

I located those files in my local matplotlib folder, which in my case are:

C:/Python27/matplotlib/axes/_axes.py

C:/Python27/matplotlib/collections.py

In the previous snippets, the lines of code that were added to the original script are highlighted in green and those that were removed are highlighted in red.  Hence, to access the clean version I clicked on the view button and selected the entire script and copied and pasted it in my local matplotlib code.  For this exercise I ended changing only a couple of scripts: the axes.py and the collections.py.

NOTE:  If you ever need to undertake this type of solution, make sure you paste the lines of code in the right places, do this part carefully.   Also, it’s always a good idea to make backups of the original files in case something goes irreversibly wrong.  Or you can always uninstall and install, no big deal.

2. Generate a tailored 6D Pareto front plot with customized legends.

Matplotlib allows visualization of 5 objectives quite easily, but scaling to 6 or more objectives can be a bit tricky.  So, lets walk through our  6 D  plots in Matplotlib. We will learn how to do one of the following plots:

Pie Day  Plot:

pie.png

St. Patrick’s Day  Plot:

stpatricks.png

2.1. Required libraries:

The following are the only libraries that you’ll need.   I import seaborn sometimes because it looks fancy but it’s totally unnecessary in this case, which is why it is commented out.

import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt
#import seaborn

2.2. Importing data:

The data file that I used consists of 6 space-separated columns, if your data has another delimiter you can just add it like so:   data= np.loadtxt(‘sample_data.txt, delimiter=’,’).  I am also multiplying the first five columns by -1 because I want to remove the negatives, this is specific to my data, you may not require to do so.

data= np.loadtxt('sample_data.txt')

#Organizing the data by objectives
obj1 = data[:,0]*-1
obj2 = data[:,1]*-1
obj3 = data[:,2]*-1
obj4 = data[:,3]*-1
obj5 = data[:,4]*-1
obj6 = data[:,5]

2.3. Object-based plotting:

To allow more customization, we need to move to a more object-based way to make the plots.  That is, storing  elements of the  plots in variables.

&lt;span class=&quot;n&quot;&gt;
fig = plt.figure() # create a figure object
ax = fig.add_subplot(111, projection='3d') # create an axes object in the figure

2.4. Setting marker options:

Any mathtext symbol can be used as a marker.  In order to use rotation to represent an additional objective  it’s preferable if the marker has a single axis of symmetry so that the rotation is distinguishable.  Here are some marker options:

pie=r'$\pi$' #pie themed option
arrow = u'$\u2193$' # Arrows
clover=r'$\clubsuit$' #Saint Patrick's theme
heart=r'$\heartsuit$' # Valentine's theme
marker=pie #this is were you provide the marker options

More marker options can be found in : http://matplotlib.org/users/mathtext.html

2.4.  Scatter 6D plot:

The first three objectives are plotted in a 3-D scatter plot, in the x,y, and z axis respectively.  The fourth objective is represented by color, the fifth by size and the sixth by rotation.  Note that the rotation is scaled in degrees.  This is the step were I had to modify matplotlib source code to enable the ‘angles’ option shown below.  Also, it may be required to scale the size objective to have the desired marker size in your plot.  You can also plot the ideal point by adding a second scatter plot specifying the ideal values for each objective.  Finally, we assign the size objective “objs” and rotation objective “objr”, this will be useful later on when setting up the legend for these two objectives.

rot_angle=180 #rotation angle multiplier
scale=2000 #size objective multiplier
#Plotting 6 objectives:
im= ax.scatter(obj1, obj2, obj3, c=obj4, s= obj5*scale, marker=marker, angles=obj6*rot_angle, alpha=1, cmap=plt.cm.summer_r)
ax.scatter(1,1,0, marker=pie, c='seagreen', s=scale, alpha=1)
objs=obj5 #size objective
objr=obj6 #rotation objective

2.5.  Main axis labels and limits:

This is extremely straightforward, you can set the x,y, and z labels and specify their limits as follows:

#Main axis labels:
ax.set_xlabel('Objective 1')
ax.set_ylabel('Objective 2')
ax.set_zlabel('Objective 3')
#Axis limits:
plt.xlim([0,1])
plt.ylim([0,1])
ax.set_zlim3d(0, 1)

2.6.  Color bar options:

The colorbar limits and labels can also be specified, as shown in the code below.  There are many colormap options in matplotlib, some of the most popular ones are: jet, hsv and spectral.   As an example, if you want to change the colormap in the code shown in part 2.4, do cmap= plt.cm.hsv.  To reverse the colormap attach an ‘_r ‘ like so: cmap= plt.cm.hsv_r.  There is also a color brewer package for the more artistic plotter.

# Set the color limits.. not necessary here, but good to know how.
im.set_clim(0.0, 1.0)

#Colorbar label:
cbar = plt.colorbar(im)
cbar.ax.set_ylabel('Objective 4')

2.6.  Size and rotation legends:

This is were it gets interesting.  The first couple of lines get the labels for legend and chose which ones to display.  This allows for much flexibility when creating the legends.  As you can see in the code below, you can show markers that correspond to the maximum and the minimum objective values to orient the reader.  You can assign the spacing between lines in the legend, the  title, weather you want to frame your legend or not, the location in the figure, etc.  Line 22 of the following code shows how to add more than one legend.  There are many options for an entirely customized legend in the legend documentation which you can explore for more options.

&lt;pre&gt;handles, labels = ax.get_legend_handles_labels()
display = (0,1,2)

#Code for size and rotation legends begins here for Objectives 5 and 6:
min_size=np.amin(objs)
max_size=np.amax(objs)

#Custom size legend:
size_max = plt.Line2D((0,1),(0,0), color='k', marker=marker, markersize=max_size,linestyle='')
size_min = plt.Line2D((0,1),(0,0), color='k', marker=marker, markersize=min_size,linestyle='')
legend1= ax.legend([handle for i,handle in enumerate(handles) if i in display]+[size_max,size_min],
[label for i,label in enumerate(labels) if i in display]+[&quot;%.2f&quot;%(np.amax(objs)), &quot;%.2f&quot;%(np.amin(objs))], labelspacing=1.5, title='Objective 6', loc=1, frameon=True, numpoints=1, markerscale=1)

markersize=15
#Custom rotation legend
rotation_max = plt.Line2D((0,1),(0,0),color='k',marker=r'$\Uparrow$', markersize=15, linestyle='')
rotation_min = plt.Line2D((0,1),(0,0),color='k', marker=r'$\Downarrow$', markersize=15, linestyle='')
ax.legend([handle for i,handle in enumerate(handles) if i in display]+[rotation_max,rotation_min],
[label for i,label in enumerate(labels) if i in display]+[&quot;%.2f&quot;%(np.amax(objr)), &quot;%.2f&quot;%(np.amin(objr))], labelspacing=1.5, title='Objective 5',loc=2, frameon=True, numpoints=1, markerscale=1)

plt.gca().add_artist(legend1)

plt.show()

You can find the full code for the previous example in the following github repository:

https://github.com/JazminZatarain/Visualization-of-multidimensional-data/blob/master/paretoplot6d.py

3. Generate 6D Pareto front and runtime videos in R.

And last but not least, let me direct everyone to Calvin’s repository: https://github.com/calvinwhealton/ParetoFrontMovie.  Where  you can find the paretoMovieFront6D.R script which enables the exploration of  the evolution of a  6D Pareto front.   It is an extremely flexible tool and it has around 50 customization options to adapt your video or your plot to your visual needs, all you need is your runtime output, so check it out.  I made the tiniest contribution to this repository so I feel totally entitled to talk about it.   Here is a snippet of the video:

video.png

Survival Function Plots in R

Survival Function Plots in R

A survival function (aka survivor function or reliability function) is a function often used in risk management for visualizing system failure points. For example, it can be used to show the frequency of a coastal defense structure failure (such as a breach in a levee) in a future state of the world.

The function itself is quite simple. For a distribution of events, the survival function (SF) is 1-CDF where CDF is the cumulative distribution function. If you’re deriving the distribution empirically, you can substitute the CDF with the cumulative frequency. It is often plotted on a semi-log scale which makes tail-area analysis easier.

I’ve written some R code that creates a primitive Survival Function plot from a vector of data.  Below is the function (Note: You can find the code and an example of its usage on bitbucket https://bitbucket.org/ggg121/r_survival_function.git)

plot.sf <- function(x, xlab=deparse(substitute(x)), left.tail=F,
  ylab=ifelse(left.tail, "SF [Cum. Freq.]", "SF  [1 - Cum. Freq.]"),
  make.plot=T, ...)
{
  num.x <- length(x)
  num.ytics <- floor(log10(num.x))
  sf <- seq(1,1/num.x,by=-1/num.x)
  
  if(left.tail){
    order.x <- order(x, decreasing=T)
    order.sf <- sf[order(order.x)]
    
  }  else {
    order.x <- order(x)
    order.sf <- sf[order(order.x)]
  }
  
  if(make.plot) {
    plot(x[order.x], sf, log="y", xlab=xlab, ylab=ylab, yaxt="n", ...)
    axis(2, at=10^(-num.ytics:0), 
         label=parse(text=paste("10^", -num.ytics:0, sep="")), las=1)
  }
  invisible(order.sf)
}

Download and source the code at the start of your R script and you’re good to go. The function, by default, creates a plot in the current plotting device and invisibly returns the survival function values corresponding to the vector of data provided. The parameter left.tail sets the focus on the left-tail of the distribution (or essentially plots the CDF on a semi-log scale). By default, the function puts the focus on the right tail (left.tail = FALSE). The make.plot parameter allows you to toggle plotting of the survival function (default is on or make.plot=TRUE. This is useful when you simply need the survival function values for further calculations or custom plots. Additional parameters are passed to the plot() function. Below is an example (which is also available in the repository).

# Source the function
source("plot_sf.r")

# Set the seed
set.seed(1234)

# Generate some data to use
my.norm <- rnorm(10000, 10, 2)
my.unif <- runif(10000)
my.weib <- rweibull(10000, 20, 5)
my.lnorm <- rlnorm(10000, 1, 0.5)


# Make the plots ----------------------
par(mfrow=c(2,2), mar=c(5,4,1,1)+0.1)

# Default plot settings
plot.sf(my.norm)

# Function wraps the standard "plot" function, so you can pass
# the standard "plot" parameters to the function
plot.sf(my.unif, type="l", lwd=2, col="blue", bty="l",
        ylab="Survival", xlab="Uniform Distribution")

# If the parameter "left.tail" is true, the plot turns into 
# a cumulative frequency plot (kind of like a CDF) that's plotted
# on a log scale.  This is good for when your data exhibits a left or
# negative skew.
plot.sf(my.weib, type="l", left.tail=T, xlab="Left-tailed Weibull Dist.")

# The function invisibly returns the survival function value.
lnorm.sf <- plot.sf(my.lnorm, type="l")
points(my.lnorm, lnorm.sf, col="red")
legend("topright", bty="n", 
       legend=c("Function Call", "Using returned values"), 
       lty=c(1,NA), pch=c(NA,1), col=c("black", "red") )

# The 'make.plot' parameter toggles plotting.
# Useful if you just want the survival function values.
norm.sf <- plot.sf(my.norm, make.plot=F)

And here’s the resulting figure from this example:

survival_function_plot_example

Now you can easily show, for example, tail-area frequency of events. For example, below is a survival function plot of a normal distribution:

survival_plot_norm_didactic

For this example, we can imagine this as a distribution of flood heights (x-axis would be flood height – note that a real distribution of flood heights would likely look drastically different from a normal distribution). With this visualization, we can easily depict the “1 in 10” or the “1 in 1,000” flood height by following the appropriate survival function value over to the corresponding flood height on the plot. Alternatively, you can determine the return period of a given flood height by following the flood height up to the plot and reading off the survival function value. Comparing multiple distributions together on a single plot (think deep uncertainty) can produce interesting decision-relevant discussion about changes in return periods for a given event or the range of possible events for a given return period.

I hope this post is useful. Survival function plots are incredibly versatile and informative…and I’ve only started to scratch the surface!