Welcome to our blog!

Welcome to Water Programming! This blog is by Pat Reed’s group at Cornell, who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

To find software:  Please consult the Pat Reed group website, MOEAFramework.org, and BorgMOEA.org.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email dfg42 “at” cornell.edu. You’ll need a WordPress.com account.

Performing Experiments on HPC Systems

A lot of the work we do in the Reed lab involves running computational experiments on High Performance Computing (HPC) systems. These experiments often consist of performing multi-objective optimization to aid decision making in complex systems, a task that requires hundreds of thousands of simulation runs and may not possible without the use of thousands of computing core. This post will outline some best practices for performing experiments on HPC systems.

1. Have a Plan

By nature, experiments run on HPC systems will consume a large amount of computational resources and generate large amounts of data. In order to stay organized, its important to have a plan for both how the computational resources will be used and how data will be managed.

Estimating your computational resources

Estimating the scale of your experiment is the first step to running on an HPC system. To make a reasonable estimate, you’ll need to gather the following pieces of information:

  • How long (in wall clock time) does a single model run take on your local machine?
  • How many function evaluations (for an MOEA run) or ensemble model runs will you need to perform?
  • How long do you have in wall clock time to run the experiment?

Using this information you can estimate the number of parallel processes that you will need to successfully run the experiment. Applications such as running the Borg MOEA are known as, “embarrassingly parallel” and scale quite well with an increase in processors, especially for problems with long function evaluation times (see Hadka and Reed, 2013 for more info). However, many other applications scale poorly, so it’s important to be aware of the parallel potential of your code. A helpful tip is to identify any inherently serial sections of the code which create bottlenecks to parallelization. Parallelizing tasks such as Monte Carlo runs and MOEA function evaluations will often result in higher efficiency than paralellizing the simulation model itself. For more resources on how to parallelize your code, see Bernardo’s post from last year.

Once you have an idea of the scale of your experiment, you’ll need to estimate the experiment’s computational expense. Each HPC resource has its own charging policy to track resource consumption. For example, XSEDE tracks charges in “service units” which are defined differently for each cluster. On the Stampede2 Cluster, a service unit is defined as one node-hour of computing time, so if you run on 100 nodes for 10 hours, you spend 1,000 service units regardless of how many core per node you utilize. On the Comet Cluster, a service unit is charged by the core-hour, so if you run 100 nodes for 10 hours and each utilizes 24 core, you’ll be charged 24,000 service units. Usually, the allocations you receive to each resource will be scaled accordingly, so even though Comet looks more expensive, you likely have a much large allocation to work with. I usually make an estimate of service units I need for an experiment and add another 20% as a factor of safety.

Data management:

Large experiments often create proportionately large amounts of data. Before you start, its important to think about where this data will be stored and how it will be transferred to and from the remote system. Many clusters have limits to how much you can store on different drives, and breaking these limits can cause performance issues for the system. System administrators often don’t take kindly to these performance issues and in extreme cases, breaking the rules may result in suspension or removal from a cluster. It helps to create an informal data management plan for yourself that specifies:

  1. How will you transfer large amounts of data to and from the cluster (tools such as Globus are helpful here).
  2. Where will you upload your code and how your files will be structured
  3. Where will you store data during your experimental runs. Often clusters have “scratch drives” with large or unlimited storage allocations. These drives may be cleaned periodically so they are not suitable for long term storage.
  4. Where will you store data during post processing. This may still be on the cluster if your post processing is computationally intensive or your local machine can’t handle the data size.
  5. Where will you store your experimental results and model data for publication and replication.

2. Test on your local machine

To make the most of your time on a cluster, its essential that you do everything you can to ensure your code is properly functioning and efficient before you launch your experiment. The biggest favor you can do for yourself is to properly test your code on a local machine before porting to the HPC cluster. Before porting to a cluster, I always run the following 4 checks:

  1. Unit testing: the worst case scenario after a HPC run is to find out there was a major error in your code that invalidates your results. To mitigate this risk as much as possible, it’s important to have careful quality control. One helpful tool for quality control is unit testing to examine every function or module in your code and ensure it is working as expected. For an introduction to unit testing, see Bernardo’s post on Python and C++.
  2. Memory checking: in low level code (think C, C++) memory leaks can be silent problem that throws off your results or crash your model runs. Sometimes, memory leaks can go undetected during small runs but add up and crash your system when run in large scale experiments. To test for memory leaks, make sure to use tools such as Valgrind before uploading your code to any cluster. This post features a nice introduction to Valgrind with C++.
  3. Timing and profiling: Profile your code to see which parts take longest and eliminate unnecessary sections. If you can, optimize your code to avoid computationally intensive pieces of code such as nested loops. There are numerous tools for profiling low level codes such as Callgrind and gprof. See this post for some tips on making your C/C++ code faster.
  4. Small MOEA tests: Running small test runs (>1000 NFE) will give you an opportunity to ensure that the model is properly interacting with the MOEA (i.e. objectives are connected properly, decision variables are being changed etc.). Make sure you are collecting runtime information so you can evaluate algorithmic performance

3. Stay organized on the cluster

After you’ve fully tested your code, it’s time to upload and port to the cluster. After transferring your code and data, you’ll likely need to use the command line to navigate files and run your code. A familiarity with Linux command line tools, bash scripting and command line editors such as vim can make your life much easier at this point. I’ve found some basic Linux training modules online that are very useful, in particular “Learning Linux Command Line” from Linked-in learning (formerly Lynda.com) was very useful.

Once you’ve got your code properly organized and compiled, validate your timing estimate by running a small ensemble of simulation runs. Compare the timing on the cluster to your estimates from local tests and reconfigure your model runs if needed. If performing an experiment with an MOEA, run a small number of NFE on a development or debug node confirm that the algorithm is running and properly parallelizing. Then run a single seed of the MOEA and perform runtime diagnostics to ensure things are working more or less as you expect. Finally, you’re ready to run the full experiment.

I’d love to hear your thoughts and suggestions

These tips have been derived from my experiences running on HPC systems, if anyone else has tips or best practices that you find useful, I’d love to hear about them in the comments.

So What’s the Rationale Behind the Water Programming Blog?

By Patrick M. Reed

In general, I’ve made an effort to keep this blog focused on the technical topics that have helped my students tackle various issues big and small. It helps with collaborative learning and maintaining our exploration of new ideas.

This post, however, represents a bit of departure from our normal posts in response to some requests for a suggested reading guide for my Fall 2019 AGU Paul A. Witherspoon Lecture entitled “Conflict, Coordination, and Control in Water Resources Systems Confronting Change” (on Youtube should you have interest). 

The intent is to take a step back and zoom out a bit to get the bigger picture behind what we’re doing as research group and much of the original motivation in initiating the Water Programming Blog itself. So below, I’ll provide a summary of papers that related to the topics covered in the lecture sequenced by the talk’s focal points at various points in time. Hope this provides some interesting reading for folks. My intent here is to keep this informal. The highlighted reading resources were helpful to me and are not meant to be a formal review of any form.

So let’s first highlight Paul Witherspoon himself, a truly exceptional scientist and leader (7 minute marker, slides 2-3).

  1. His biographical profile
  2. A summary of his legacy
  3. The LBL Memo Creating the Earth Sciences Division (an example of institutional change)

Next stop, do we understand coordination, control, and conflicting objectives in our institutionally complex river basins (10 minute marker, slides 6-8)? Some examples and a complex systems perspective.

  1. The NY Times Bomb Cyclone Example
  2. Interactive ProPublica and The Texas Tribune Interactive Boomtown, Flood Town (note this was written before Hurricane Harvey hit Houston)
  3. A Perspective on Interactions, Multiple Stressors, and Complex Systems (NCA4)

Does your scientific workflow define the scope of your hypotheses? Or do your hypotheses define how you need to advance your workflow (13 minute marker, slide 9)? How should we collaborate and network in our science?

  1. Dewey, J. (1958), Experience and Nature, Courier Corporation.
  2. Dewey, J. (1929), The quest for certainty: A study of the relation of knowledge and action.
  3. Hand, E. (2010), ‘Big science’ spurs collaborative trend: complicated projects mean that science is becoming ever more globalized–and Europe is leading the way, Nature, 463(7279), 282-283.
  4. Merali, Z. (2010), Error: Why Scientific Programming Does Not Compute, Nature, 467(October 14), 775-777.
  5. Cummings, J., and S. Kiesler (2007), Coordination costs and project outcomes in multi-university collaborations, Research Policy, 36, 1620-1634.
  6. National Research Council (2014), Convergence: facilitating transdisciplinary integration of life sciences, physical sciences, engineering, and beyond, National Academies Press.
  7. Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, and P. E. Bourne (2016), The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data, 3.
  8. Cash, D. W., W. C. Clark, F. Alcock, N. M. Dickson, N. Eckley, D. H. Guston, J. Jäger, and R. B. Mitchell (2003), Knowledge systems for sustainable development, Proceedings of the National Academy of Sciences, 100(14), 8086-8091.

Perspectives and background on Artificial Intelligence (15 minute marker, slides 10-16)

  1. Simon, H. A. (2019), The sciences of the artificial, MIT press.
  2. AI Knowledge Map: How To Classify AI Technologies
  3. Goldberg, D. E. (1989), Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Publishing Company, Reading, MA
  4. Coello Coello, C., G. B. Lamont, and D. A. Van Veldhuizen (2007), Evolutionary Algorithms for Solving Multi-Objective Problems, 2 ed., Springer, New York, NY.
  5. Hadka, D. M. (2013), Robust, Adaptable Many-Objective Optimization: The Foundations, Parallelization and Application of the Borg MOEA.

The Wicked Problems Debate (~22 minute marker, slides 17-19) and the emergence of post normal science and decision making under deep uncertainty.

  1. Rittel, H., and M. Webber (1973), Dilemmas in a General Theory of Planning, Policy Sciences, 4, 155-169.
  2. Buchanan, R. (1992), Wicked problems in design thinking, Design Issues, 8(2), 5-21.
  3. Kwakkel, J. H., W. E. Walker, and M. Haasnoot (2016), Coping with the Wickedness of Public Policy Problems: Approaches for Decision Making under Deep Uncertainty, Journal of Water Resources Planning and Management, 01816001.
  4. Ravetz, J. R., and S. Funtowicz (1993), Science for the post-normal age, Futures, 25(7), 735-755.
  5. Turnpenny, J., M. Jones, and I. Lorenzoni (2011), Where now for post-normal science?: a critical review of its development, definitions, and uses, Science, Technology, & Human Values, 36(3), 287-306.
  6. Marchau, V. A., W. E. Walker, P. J. Bloemen, and S. W. Popper (2019), Decision making under deep uncertainty, Springer.
  7. Mitchell, M. (2009), Complexity: A guided tour, Oxford University Press.

Lastly, the Vietnam and North Carolina application examples.

  1. Quinn, J. D., P. M. Reed, M. Giuliani, and A. Castelletti (2019), What Is Controlling Our Control Rules? Opening the Black Box of Multireservoir Operating Policies Using Time-Varying Sensitivity Analysis, Water Resources Research, 55(7), 5962-5984.
  2. Quinn, J. D., P. M. Reed, M. Giuliani, A. Castelletti, J. W. Oyler, and R. E. Nicholas (2018), Exploring How Changing Monsoonal Dynamics and Human Pressures Challenge Multireservoir Management for Flood Protection, Hydropower Production, and Agricultural Water Supply, Water Resources Research, 54(7), 4638-4662.
  3. Trindade, B. C., P. M. Reed, and G. W. Characklis (2019), Deeply uncertain pathways: Integrated multi-city regional water supply infrastructure investment and portfolio management, Advances in Water Resources, 134, 103442.
  4. Gold, D., Reed, P. M., Trindade, B., and Characklis, G., “Identifying Actionable Compromises: Navigating Multi-City Robustness Conflicts to Discover Cooperative Safe Operating Spaces for Regional Water Supply Portfolios.“, Water Resources Research,  v55, no. 11,  DOI:10.1029/2019WR025462, 9024-9050, 2019.

R Shiny, Part 2

In the last blog post, I went over how you can use R’s Shiny package to create interactive plots like this one. Before you continue with this post, I recommend you to go back and take a look at that. More information about R Shiny can be found in many of the online instructions, such as here and here. In this post, I show you how to create interactive maps using the Shiny library.

First, let’s load the Shiny library.

library(shiny)

You can set your working directory here. Keep in mind that if you’re developing an online app, you can’t set a working directory. A

The following procedure loads the “openxlsx” library and reads our input file.

library(openxlsx)

You can read your input excel file using the following command and the input file can be downloaded from here.

crop_data<-read.xlsx("YOUR-ADDRESS/Input to Shiny-2.xlsx", "Final")

User Interface

Now we can define the user interface of our interactive plot (or online app). In this example, the UI part of the app is very simple, but you can make your own a lot prettier and more elegant.

ui <- fluidPage( 
  
# Here we define the title of our interactive plot:
  titlePanel("Acreage of Important Crops in California"),

# We use the fluid row command to have a better handle on 
# the way we define the positions of different elements. 
# But as I mentioned, I’ll keep the UI very simple in this example.

  fluidRow(

# First, let's design our selection panel.

                  column(10, offset = 4,
# In this example, we use a drop-down menu. 
# Users can choose between different crop types. 
# We use the "selectInput" command to do this.

                         selectInput("Crops", label = "Crops", choices = c("Almond", "Grape" , "Hay"))
                ),
                
# Here we define our main panel, which displays our map.
                column(10, offset = 1,
                       mainPanel(

# This command indicates that we are creating a map and defines the size of it.           
                  leafletOutput("mymap", width = "900", height = "800")
                )
                )

  ) #fluidRow
) # fluid page

Server

Remember that the server part of the code basically takes care of calculations and generates graphs, table, maps, and so forth.

server<-function(input, output){
  
# First, you need to make sure that the maps package is 
# installed on your machine .Then you need to load it. 
# The following procedure does that for you.

#install.packages("maps")
#library(maps)

# The “map” command in the maps library allows you to generate 
# different types of maps.
# We use the following command to generate a map of the State 
# of California that includes its counties.

    mapStates = map('county', 'california', , fill = TRUE, col = palette())

# Here we create the output map that we will send back to the user interface.
    output$mymap<- renderLeaflet({
  
# Here we read the column corresponding to the 
# crop that the user has selected.
# Note that there are two input arguments when we define our server function, 
# "input" and "output." These are used to transfer information between the server and UI components.

# The "input$Crops" part of the code basically takes the choice the 
# user made in UI part of the model and makes it available to the server part.
      
    values<-as.numeric(unlist(crop_data[paste(input$Crops, "_Acres", sep ="" )]))
    
# This part of the code defines the bins and a sequence of colors. Then we assign colors to bins.
  bins <- c(0, 1000, 5000, 10000, 20000, 50000, 100000, 150000, 200000, 250000, Inf)
  palette_crop <- colorBin("BrBG", domain = values, bins = bins)


# This part of the program defines the labels that will be 
# displayed when the user hovers the pointer over different parts of the map.
# We use the "sprintf" command to convert plain text combinations 
# (such as the ones we make with the paste command) into visually attractive and meaningful objects on the map.
  
  labels <- sprintf(
    "<strong>%s</strong><br/>%g Acres </sup>",
    crop_data$NAME, values # crop_data$Hay_Acres
  ) %>% lapply(htmltools::HTML)
  
# We use the "leaflet" command from the leaflet package to generate a background map, 
# and then we put our California map on top of it. 
# The following commands can be used to install and load the leaflet library.

#install.packages("leaflet")
#library(leaflet)
  
    leaflet(data = mapStates) %>% addTiles( ) %>%
# We use the “add polygon” command from leaflet package to add more 
# information to the map. In this case, we add our color-coded pallet to the plot. 

      addPolygons(
        fillColor = ~palette_crop(as.numeric(unlist(crop_data[paste(input$Crops, "_Acres", sep ="" )]))),
        weight = 2,
        opacity = 1,
        color = "white",
        dashArray = "3",
        fillOpacity = 0.7,
# Here we define how the polygons should be highlighted.
        highlight = highlightOptions(
          weight = 5,
          color = "#666",
          dashArray = "",
          fillOpacity = 0.7,
          bringToFront = TRUE),
        label = labels,
        labelOptions = labelOptions(
          style = list("font-weight" = "normal", padding = "3px 8px"),
          textsize = "15px",
          direction = "auto"))%>%
# Now we can add legends to the map.
    addLegend(pal = palette_crop, values = ~density, opacity = 0.7, title = NULL,
              position = "bottomleft")
  })
  
}

Finally, the following command ties together the UI and server functions and creates a Shiny app.

shinyApp(ui, server)

You can find the code and the online app produced in this example here.

R Shiny – Part 1

In this blog post, I explain a helpful R package that you can use to create interactive plots and web applications. It’s called shiny, and it can also be used to create websites with interactive user interfaces. The interactive plots can contain graphs, maps, tables and other details you need on your site. In this post, I cover some basics of R shiny and some guidelines to creating interactive plots with it. In the next two posts, I’ll provide more details on how the package can be used to create simple web apps.

Shiny apps have two main functions: ui.R and sever.R. These functions can be in two separate R files or a single one. The ui.R function is used to create the graphical user interface; I provide more details below. The server.R function is where calculations are done and graphs are generated. In other words, ui.R is equivalent to front-end development and server.R to the back-end part of the code. Before I get into the details, I’ll note that there are several nice instruction manuals out there that you can refer to for more information. Probably the best is Shiny’s original manual, available at here and here.

To showcase some of the things you can create with R shiny, I downloaded the package download logs from November 1, 2019 here. This file has more than 3 million lines, corresponding to all the CRAN packages that were downloaded that day. I post-processed the data so that the final file lists the packages by the number of times they were downloaded that day. For example, this graph shows the first thirty packages. You can download the final post-processed file from here.

User Interface

The user interface part of our script divides the page into several components and defines the characteristics of each. Let’s say we want a side bar that provides options for the user. We can instruct the model on the layout of our web app, the size of each component, the details of any slide bars or dropdown menus we want, and so forth. The following is a user interface script I created, in which I added some comments to explain what each part of the code does.

library(shiny)
library(ggpubr)
library(ggplot2)

setwd("~/blog posts/R shiny/") 

# UI 

ui <- fluidPage(  
  # The fluidPage command translates R shiny codes to HTML format 
  
  
  
  titlePanel("Most Downloaded R Packages"), # Here we can define the main title of our web app
  
  
  fluidRow( 
    
    # The fluidRow command gives us the flexibility to define 
    # our layout column and offset commands are two of the popular arguments
    # that can be used to generate our layout
    # There are also other predefined layouts available in 
    # R-Shiny such as sidebar layout, vertical layout and split layout
    
    
    # Here we width and position of our object in the web app
    column(10, offset = 4,
           
    # Users can have different types of widgets 
    # in R-Shiny, such as dropdown menus and date range inputs. 
    # One of the popular ways to set a number
    # is through a slider widget, the following two lines 
    # define two slider bars to set different values
    # The sliderInputs command allows us to define the range 
    # and default value of our slider widget
           
           sliderInput(inputId = "NumLibs1",
                       label = "1- Number of libraries in PDF plot:", min = 30, max = 1000, value = 30),
           
           
           sliderInput(inputId = "NumLibs2",
                       label = "2- Number of libraries in bar plot:", min = 3, max = 30, value = 10)
    ),
    
    # The mainPanel command defines the size and layout 
    # structure of our main panel where we show our actual plots
    column(12, 
           mainPanel(
             
    # Creates a plot that will be placed in the main panel
             plotOutput(outputId = "plot",width = "150%", height = "450")
             
           )
    )
    
  ) 
)

Server

The server provides information about calculations that need to be done behind the scenes and then plots and generates the graphics, defines the properties of table, and so on. The following is server code I developed to provide information on all the components of my plot.


# Server 

server<-function(input, output){ 
  # Here we define a server function that does 
  # behind the scene calculations and generates our plot
  
  input_data<- read.table("pkg_by_freq.txt", header = T)
  # Reads the input file from our working directory
  
  reorderd_freq<-input_data[order(input_data$pkg_count, decreasing = T),]
  # This command sorts our input data based on number of downloads 
  # in that particular day (11/01/2019)
  
  output$plot <- renderPlot({
    # This part renders our output plot
    
    max_numb<-input$NumLibs1
    num_pop_libs<-input$NumLibs2
    # Here our code receives the number that will 
    # be set by users through the slider widget
    
    p1<-ggplot(reorderd_freq[6:max_numb,], aes(x=pkg_count)) +geom_density(fill="lightblue")+
      labs(title = "1- PDF of Numuer of Downloads of R Packages",  x="Number of Downloads", y="") +
      theme_bw() +theme(axis.text.x  = element_text(size = rel(1.8)) )
    
    reorderd_freq$pkg_name <- reorder(reorderd_freq$pkg_name, reorderd_freq$pkg_count)
    p2<-ggplot(reorderd_freq[1:num_pop_libs,])+ geom_bar(aes(x=pkg_name, y=pkg_count), stat="identity", fill="purple1") +
      labs(title = "2- Most Downloaded R Packages", y="Number of Downloads", x="Package Name") +
      coord_flip() +theme_bw() +theme(axis.text.y  = element_text(size = rel(1.4)) )
    # Now we use ggplot2 package to generate two figures a PDF plot (geom_density) and a bar plot (geom_bar)
    # Note that we use the slider input to change the characteristics of our plot
    
    ggarrange(p1, p2)
    # Finally we combine our two plots using ggarange function from the ggpubr package
  })
  
}

One thing to keep in mind is that if you have two separate files for ui.R and sever.R, you always have to save them in the same folder.

When you’re done with your ui.R and server.R, you can either use your R-Studio run bottom or the runApp() command to combine all the components and create your final output.

# This command connects the UI code with the server code and 
# generates our final output 
shinyApp(ui, server)

And here is the final interactive figure that you will produce:

Also, I created this simple interactive webpage to show the results of my R code example (download the R code from here). The left graph shows the probability density function of the different R libraries downloaded on November 1, 2019. Because more than 15,000 R libraries were downloaded that day, the graph allows you to see the distribution of the libraries with the highest download rates. The slider lets you change the number plots in the graph. The right plot shows the most downloaded libraries, and the slider lets you to include an arbitrary number of the most popular libraries in the figure.

In the next couple of posts, I will give more examples of R shiny’s capabilities and ways to make websites using it.

ggplot (Part 2)

This is the second part of the ggplot introduction. In this blog post, I am going to go over how you can make a decent density plot in ggplot. Density plots are basically smoothed versions of the histogram and show the distribution of your data while also presenting the probability distribution of the data using the kernel density estimation procedure. For example, when we have a regional data set, it is important to look at the distribution of our data across the region instead of just considering the region average. In our example (download the data set from here), we are going to visualize the regional distribution of simulated average winter wheat yield for 30 years from 1981 to 2010. The “ID” column in the data set represents one grid cell in the region, and there are 1,812 total grid cells. For each grid cell, the average historical yield and the standard deviation of yield during 30 years were given. First, we need to load the library; then, in the general code structure of “ggplot ( dataframe , aes ( x , y , fill )),” we need to specify x-axis to “yield.” The y-axis will be calculated and added through “geom_density()”. Then, we can add a color, title, and label and customize the background.

example1<- read.csv("(your directory)/example_1.csv")
library(ggplot2)   
ggplot(example1, aes(x=example1$period_ave_Y))+ 
geom_density(fill="blue")+
 theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"))+
  labs(title = paste("Density Plot of Regional Average Historical Yield (30 years)"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")

Now, we want to know how the standard deviation of 30 years’ average yield for all the grid cells in the region can be mapped into this density plot.

We can add another column (name it “SD_class”) to the data set and classify the standard deviations. The maximum and minimum standard deviations among all the grid cells are the following.

max(example1$period_sd_Y)
# [1] 3.605131
min(example1$period_sd_Y)
# [1] 0.8645882

For example, I want to see this plot categorized by standard deviations between 0.8 to 1.5, 1.5 to 2.5, and 2.5 to the maximum value. Here, I am writing a simple loop to go over each row and check the standard deviation value for each row (corresponding to each grid cell in a region); I fill the newly added column (“SD_class”) with the correct class that I specify in the “if statement.”

example1$SD_class<- NA
for (i in 1:nrow(example1)){
  if(example1[i,2]>0.8 && example1[i,2]<= 1.5) {example1[i,4]<- c("0.8-1.5")}
  if(example1[i,2]>1.5 && example1[i,2]<= 2.5) {example1[i,4]<- c("1.5-2.5")}
  if(example1[i,2]>2.5) {example1[i,4]<- c("2.5-3.6")}
}

Now, we just need to add “fill” to the aesthetics section of the code, specify the column with the classifications, and add “alpha” to make the color transparent in order to see the shapes of the graphs and whether they have overlaps.

ggplot(example1, aes(x=example1$period_ave_Y,fill =SD_class))+
  geom_density(alpha=0.4)+
  theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
        axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold"),plot.title = element_text(size = 20, face = "bold"),
        legend.text=element_text(size=13),legend.title=element_text(size=14))+
  labs(title = paste("Density Plot of Regional Average Historical Yield (30 years)"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")

We can also use the “facet_grid()” option, like the plot in Part (1), and specify the column with classification to show each of these classes in a separate panel.

ggplot(example1, aes(x=example1$period_ave_Y,fill =SD_class))+
  geom_density(alpha=0.4)+facet_grid(example1$SD_class ~ .)+
  theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
        axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold"),plot.title = element_text(size = 20, face = "bold"),
        legend.text=element_text(size=13),legend.title=element_text(size=14))+
  labs(title = paste("Density Plot of Regional Average Historical Yield (30 years)"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")

The other interesting variables that we can explore are different percentiles of our data set that correspond to the density plot. For this, we need to obtain the density values (y-axis on the plot) for the percentiles that we are interested in—for example 10%, 25%, 50%, 75%, and 90%. Also we need to find out the actual yield value corresponding to each percentile:

quantiles_yield <- quantile(example1$period_ave_Y, prob=c(0.1, 0.25, 0.5, 0.75, 0.9))
#     10%      25%      50%      75%      90% 
#  4.229513 5.055070 5.582192 5.939071 6.186014

Now, we are going to estimate the density value for each of the yields at the 10th, 25th, 50th, 75th, and 90th percentiles.

df <- approxfun(density(example1$period_ave_Y))

The above function will give us the approximate density value for each point (yield) in which we are interested—in our case, yields for the above percentiles:

df(c(quantiles_yield))
#[1] 0.1176976 0.3267841 0.6129621 0.6615790 0.4345247

Now, we can add several vertical segments to the density plot that show where each percentile is located on this graph. The limits of these segments on the y-axis are based on the density values for each percentile that we got above. Also, note that I used those values to adjust the positions of the labels for the segments.

ggplot()+ 
      geom_density(aes(x=example1$period_ave_Y),fill="blue",alpha=0.4) + 
    geom_segment(aes(x=quantiles_yield, y=0, xend =quantiles_yield,
                     yend= df(c(quantiles_yield))),size=1,colour =c("red","green","blue","purple","orange"),linetype='dashed')+
      theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
            axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold"),plot.title = element_text(size = 20, face = "bold"),
            legend.text=element_text(size=13),legend.title=element_text(size=14))+
      labs(title = paste("Density Plot of Regional Average Historical Yield (30 years) and Percentiles"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")+
    annotate("text", x=4.229513, y=0.15, label=paste("10%"),size=5)+
    annotate("text", x=5.055070, y=0.36, label=paste("25%"),size=5)+
    annotate("text", x=5.582192, y=0.65, label=paste("50%"),size=5)+
    annotate("text", x=5.939071, y=0.7, label=paste("75%"),size=5)+
    annotate("text", x=6.186014, y=0.47, label=paste("90%"),size=5) 

Hydro Packages in R: HydroGOF

In this blog post, I will go over a very helpful hydrologic package in R that can make your hydro-life much easier. The package is called HydroGOF, and it can be used to make different types of plots, including mean monthly, annual, and seasonal plots for streamflow, rainfall, temperature, and other environmental variables. You can also use HydroGOF to compare your simulated flow to observed flow and calculate various performance metrics such as Nash-Sutcliffe efficiency. Indeed, the GOF part of HydroGOF stands for “goodness of fit.” More information about HydroGOF and its applications for hydrologists can be found here. Also, you can find a more comprehensive list of hydrologic R packages from this water programming blog post.

1- Library and Data Preparation

HydroGOF accepts R data frames and R zoo objects. If you are not familiar with R’s zoo objects, you can find more information at here. In this tutorial, I use HydroGOF’s test case streamflow data, which are in zoo format. Here is how you can install and load zoo and HydroGOF.

install.packages("zoo")
library(zoo)
install.packages("hydroGOF ")
library(hydroGOF)

After you load the package, you need to activate your streamflow data. This is how you do so.

# Activate HydroGOF's streamflow data
data(EgaEnEstellaQts) 

Now, let’s take a look at the first few lines of our streamflow data.

head(EgaEnEstellaQts)

Note that the first column is date and that the second column is streamflow data; the unit is m3/sec. Also, keep in mind that you can use zoo to manipulate the temporal regime of your data. For example, you can convert your daily data to monthly or annual.

2- Streamflow Plots

Now, let’s use HydroGOF to visualize our observed streamflow data. You can use the following commands to generate some nice figures that will help you explore the overall regime of the streamflow data.

obs<-EgaEnEstellaQts

hydroplot(x = obs,var.type = "FLOW", var.unit = "m3/s", ptype = "ts+boxplot", FUN=mean)
# Note that "hydroplot" command is very flexible and there are many
# options that users can add or remove

3- Generate Simulated Dataset

For this tutorial, I have written the following function, which uses observed streamflow to generate a very simple estimation of daily streamflow. Basically, the function takes the observed data and calculates daily average flow for each day of the year. Then, the function repeats the one-year data as many times as you need, which for our case, is ten times to match the observed flow.

simple_predictor<-function(obs){
  # This function generates a very simple prediction of streamflow 
  # based on observed streamflow inputs

  DOY<-data.frame(matrix(ncol =1, nrow = length(EgaEnEstellaQts))) 
  
  for (i_day in 1:length(EgaEnEstellaQts)){
    DOY[i_day,]=as.numeric(strftime(index(EgaEnEstellaQts[i_day]), format = "%j"))
  }
 
# Create a 365 day timeseries of average daily streamflow.  
  m_inflow_obs<-as.numeric(aggregate(obs[,1], by=list(DOY[,1]), mean)) 
  
  simplest_predictor<-data.frame(matrix(ncol=3, nrow =length(obs )))
  names(simplest_predictor)<-c("Date", "Observed", "Predicted")
  simplest_predictor[,1]=index(obs)
  
  simplest_predictor[,2]=coredata(obs)
  
  for (i_d in 1:length(obs)){
   # Iterates average daily flow for entire simulation period 
    simplest_predictor[i_d,3]=m_inflow_obs[DOY[i_d,1]]
  }
  # Convert to zoo format
  dt_z<-read.zoo(simplest_predictor, format="%Y-%m-%d") 
  
  return(dt_z)
}

After loading the function, you can use the following to create your combined observed and simulated data frame.

# Here we just call the function
obs_sim<-simple_predictor(obs)

4- Hydrologic Error Metrics Using HydroGOF

There are twenty error metrics in HydroGOF—for example, mean error (ME), mean absolute error (MAE), root mean square error (RMSE), normalized root mean square error (NRMSE), percent bias (PBIAS), ratio of standard deviations (Rsd), and Nash-Sutcliffe efficiency (NSE). You can find more information about them here. You can use the following commands to calculate specific error metrics.

# Nash-Sutcliffe Efficiency
NSE(sim=obs_sim$Predicted, obs=obs_sim$Observed) 
# Root Mean Squared Error
rmse(sim=obs_sim$Predicted, obs=obs_sim$Observed) 
# Mean Error
me(sim=obs_sim$Predicted, obs=obs_sim$Observed) 

You can also use this cool command to see all of the available error metrics in HydroGOF.

gof(sim=obs_sim$Predicted, obs=obs_sim$Observed)

5- Visualizing Observed and Simulated Streamflow

Here is the most interesting part: you can plot observed and simulated on the same graph and add all error metrics to the plot.

ggof(sim=obs_sim$Predicted, obs=obs_sim$Observed, ftype="dm", gofs = c("NSE", "rNSE", "ME", "MSE",  "d", "RMSE", "PBIAS"), FUN=mean)
# You should explore different options that you can add to this figure. 
# For example you can choose which error metrics you want to display, etc

Beginner’s Guide to TensorFlow and Keras

This post is meant to provide a very introductory overview of TensorFlow and Keras and concludes with example of a how to use these libraries to implement a very basic neural network.

TensorFlow

At core, TensorFlow is a free, open-source symbolic math library that expresses calculations in terms of dataflow graphs that are composed of nodes and tensors. TensorFlow is particularly adept at handling operations required to train neural networks and thus has become a popular choice for machine learning applications1. All nodes and tensors as well as the API are Python-based. However, the actual mathematical operations are carried out efficiently using high-performance C++ binaries2. TensorFlow was originally developed by the Google Brain Team for internal use but since has been released for public use. The library has a reputation of being on the more technical side and less intuitive to new users, so based on user feedback from initial versioning, TensorFlow decided to add Keras to their core library in 2017.

Keras

Keras is an open-source neural network Python library that has become a popular API to run on top of TensorFlow and other deep learning frameworks. Keras is useful especially for beginners in machine learning because it offers a user-friendly, intuitive, and quick way to build models. Particularly, users tend to most interested in quickly implementing neural networks. In Keras, neural network models are composed of standalone modules of neural network layers, cost functions, optimizers, activation functions, and regularizers that can be built separately and combined3.

Example: Predicting the age of abalone

In the example below, I will implement a very basic neural network to illustrate the main components of Keras. The problem we are interested in solving is predicting the age of abalone (sea snails) given a variety of physical characteristics. Traditionally, the age of abalone is calculated by cutting the shell, staining it, and counting the number of rings in the shell, so the goal is to see if less time-consuming and intrusive measurements, such as the diameter, length, and gender, can be used to predict the age of the snail. This dataset is one of the most popular regression datatsets from the UC Irvine Machine Learning Repository. The full set of attributes and dataset can be found here.

Step 1: Assessing and Processing Data

The first step to approaching this (and any machine learning) problem is to assess the quality and quantity of information that is provided in the dataset. We first import relevant basic libraries. More specific libraries will be imported above their respective functions for illustrative purposes. Then we import the dataset which has 4177 observations of 9 features and also has no null values.


#Import relevant libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#Import dataset
dataset = pd.read_csv('Abalone_Categorical.csv')
#Print number of observations and features
print('This dataset has {} observations with {} features.'.format(dataset.shape[0], dataset.shape[1]))
#Check for null values
dataset.info()

Note that the dataset is composed of numerical attributes except for the “Gender” column which is categorical. Neural networks cannot work with categorical data directly, so the gender must be converted to a numerical value. Categorical variables can be assigned a number such as Male=1, Female=2, and Infant=3, but this is a fairly naïve approach that assumes some sort of ordinal nature. This scale will imply to the neural network that a male is more similar to a female than an infant which may not be the case. Instead, we instead use one-hot encoding to represent the variables as binary vectors.

#One hot encoding for categorical gender variable

Gender = dataset.pop('Gender')

dataset['M'] = (Gender == 'M')*1.0
dataset['F'] = (Gender == 'F')*1.0
dataset['I'] = (Gender == 'I')*1.0

Note that we now have 10 features because Male, Female, and Infant are represented as the following:

M

F

I

1

0 0

0

1

0

0 0

1

It would be beneficial at this point to look at overall statistics and distributions of each feature and to check for multicollinearity, but further investigation into these topics will not be covered in this post. Also note the difference in the ranges of each feature. It is generally good practice to normalize features that have different units which likely will correspond to different scales for each feature. Very large input variables can lead to large weights which, in turn, can make the network unstable. It is up to the user to decide on the normalization technique that is appropriate for their dataset, with the condition that the output value that is returned also falls in the range of the chosen activation function. Here, we separate the dataset into a “data” portion and a “label” portion and use the MinMaxScalar from Scikit-learn which will, by default, transform the data into the range: (0,1).

#Reorder Columns

dataset = dataset[['Length', 'Diameter ', 'Height', 'Whole Weight', 'Schucked Weight','Viscera Weight ','Shell Weight ','M','F','I','Rings']]

#Separate input data and labels
X=dataset.iloc[:,0:10]
y=dataset.iloc[:,10].values

#Normalize the data using the min-max scalar

from sklearn.preprocessing import MinMaxScaler
scalar= MinMaxScaler()
X= scalar.fit_transform(X)
y= y.reshape(-1,1)
y=scalar.fit_transform(y)

We then split the data into a training set that is composed of 80% of the dataset. The remaining 20% is the testing set.

#Split data into training and testing 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Step 2: Building and Training the Model

Then we build the Keras structure. The core of this structure is the model, which is of the “Sequential” form. This is the simplest style of model and is composed of a linear stack of layers.

#Build Keras Model

import keras
from keras import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(units=10, input_dim=10,activation='relu'))
model.add(Dense(units=1,activation='linear'))
model.compile(optimizer='adam', loss='mean_squared_error',  metrics=['mae','mse'])

early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

history=model.fit(X_train,y_train,batch_size=5, validation_split = 0.2, callbacks=[early_stop], epochs=100)

# Model summary for number of parameters use in the algorithm
model.summary()

Stacking layers is executed with model.add(). We stack a dense layer of 10 nodes that has 10 inputs feeding into the layer. The activation function chosen for this layer is a ReLU (Rectified Linear Unit), a popular choice due to better gradient propagation and sparser activation than a sigmoidal function for example. A final output layer with a linear activation function is stacked to simply return the model output. This network architecture was picked rather arbitrarily, but can be tuned to achieve better performance. The model is compiled using model.compile(). Here, we specify the type of optimizer- in this case the Adam optimizer which has become a popular alternative to the more traditional stochastic gradient descent. A mean-squared-error loss function is specified and the metrics reported will be “mean squared error” and “mean absolute error”. Then we call model.fit() to iterate training in batch sizes. Batch size corresponds to the number of training instances that are processed before the model is updated. The number of epochs is the number of complete passes through the dataset. The more times that the model can see the dataset, the more chances it has to learn the patterns, but too many epochs can also lead to overfitting. The number of epochs appropriate for this case is unknown, so I can implement a validation set that is 20% of my current training set. I set a large value of 100 epochs and add early stopping criteria that stops training when the validation score stops improving and helps prevent overfitting.

Training and validation error can be plotted as a function of epochs4 .

def plot_history(history):
  hist = pd.DataFrame(history.history)
  hist['epoch'] = history.epoch

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Square Error [$Rings^2$]')
  plt.plot(hist['epoch'], hist['mean_squared_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_squared_error'],
           label = 'Val Error')
  plt.legend()
  plt.show()

plot_history(history)

This results in the following figure:

Figure_2

Error as function of epochs

Step 3: Prediction and Assessment of Model

Once our model is trained, we can use it to predict the age of the abalone in the test set. Once the values are predicted, then they must be re-scaled back which is performed using the inverse_transform function from Scikit-learn.


#Predict testing labels

y_pred= model.predict(X_test)

#undo normalization 

y_pred_transformed=scalar.inverse_transform(y_pred.reshape(-1,1))
y_test_transformed=scalar.inverse_transform(y_test)

Then the predicted and actual ages are plotted along with a 1-to-1 line to visualize the performance of the model. An R-squared value and a RMSE are calculated as 0.554 and 2.20 rings respectively.


#visualize performance
fig, ax = plt.subplots()
ax.scatter(y_test_transformed, y_pred_transformed)
ax.plot([y_test_transformed.min(), y_test_transformed.max()], [y_test_transformed.min(), y_test_transformed.max()], 'k--', lw=4)
ax.set_xlabel('Measured (Rings)')
ax.set_ylabel('Predicted (Rings)')
plt.show()

#Calculate RMSE and R^2
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(y_test_transformed, y_pred_transformed))

from sklearn.metrics import r2_score
r_squared=r2_score(y_test_transformed,y_pred_transformed)

Figure_1

Predicted vs. Actual (Measured) Ages

The performance is not ideal, but the results are appropriate given that this dataset is notoriously hard to use for prediction without other relevant information such as weather or location. Ultimately, it is up to the user to decide if these are significant/acceptable values, otherwise the neural network hyperparameters can be further fine-tuned or more input data and more features can be added to the dataset to try to improve performance.

Sources:

[1]Dean, Jeff; Monga, Rajat; et al. (November 9, 2015). “TensorFlow: Large-scale machine learning on heterogeneous systems” (PDF)TensorFlow.org. Google Research

[2] Yegulalp, Serdar. “What Is TensorFlow? The Machine Learning Library Explained.” InfoWorld, InfoWorld, 18 June 2019, http://www.infoworld.com/article/3278008/what-is-tensorflow-the-machine-learning-library-explained.html.

[3]“Keras: The Python Deep Learning Library.” Home – Keras Documentation, keras.io/.

[4] “Basic Regression: Predict Fuel Efficiency  :   TensorFlow Core.” TensorFlow, http://www.tensorflow.org/tutorials/keras/regression.