Job scheduling on HPC resources

Architecture of a HPC Cluster

Modern High Performance Computing (HPC) resources are usually composed of a cluster of computing nodes that provide the user the ability to parallelize tasks and greatly reduce the time it takes to perform complex operations. A node is usually defined as a discrete unit of a computer system that runs its own instance of an operating system. Modern nodes have multiple chips, often known as Central Processing Units or CPUs, which each contain multiple cores each capable of processing a separate stream of instructions (such as a single Monte Carlo run). An example cluster configuration is shown in Figure 1.


Figure 1. An example cluster configuration

To efficiently make use of a cluster’s computational resources, it is essential to allow multiple users to use the resource at one time and to have an efficient and equatable way of allocating and scheduling computing resources on a cluster. This role is done by job scheduling software. The scheduling software is accessed via a shell script called in the command line. A scheduling  script does not actually run any code, rather it provides a set of instructions for the cluster specifying what code to run and how the cluster should run it. Instructions called from a scheduling script may include but are not limited to:

  • What code would you like the cluster to run
  • How would you like to parallelize your code (ie MPI, openMP ect)
  • How many nodes would you like to run on
  • How many core per processor would you like to run (normally you would use the maximum allowable per processor)
  • Where would you like error and output files to be saved
  • Set up email notifications about the status of your job

This post will highlight two commonly used Job Scheduling Languages, PBS and SLURM and detail some simple example scripts for using them.


The Portable Batching System (PBS) was originally developed by NASA in the early 1990’s [1] to facilitate access to computing resources.  The intellectual property associated with the software is now owned by Altair Engineering. PBS is a fully open source system and the source code can be found here. PBS is the job scheduler we use for the Cube Cluster here at Cornell.

An annotated PBS submission script called “” that runs a C++ code called “triangleSimulation.cpp” on 128 cores can be found below:

#PBS -l nodes=8:ppn=16    # how many nodes, how many cores per node (ppn)
#PBS -l walltime=5:00:00  # what is the maximum walltime for this job
#PBS -N SimpleScript      # Give the job this name.
#PBS -M # email address for notifications
#PBS -j oe                # combine error and output file
#PBS -o outputfolder/output.out # name output file

cd $PBS_O_WORKDIR # change working directory to current folder

#module load openmpi/intel # load MPI (Intel implementation)
time mpirun ./triangleSimulation -m batch -r 1000 -s 1 -c 5 -b 3

To submit this PBS script via the command line one would type:


Other helpful PBS commands for UNIX can be found here. For more on PBS flags and options, see this detailed post from 2012 and for more example PBS submission scripts see Jon Herman’s Github repository here.


A second common job scheduler is know as SLURM. SLURM stands for “Simple Linux Utility Resource Management” and is the scheduler used on many XSEDE resources such as Stampede2 and Comet.

An example SLURM submission script named “” that runs “triangleSimulation.cpp” on 128 core can be found below:

#SBATCH --nodes=8             # specify number of nodes
#SBATCH --ntasks-per-node=16  # specify number of core per node
#SBATCH --export=ALL
#SBATCH -t 5:00:00            # set max wallclock time
#SBATCH --job-name="triangle" # name your job #SBATCH --output="outputfolder/output.out"

#ibrun is the command for MPI
ibrun -v ./triangleSimulation -m batch -r 1000 -s 1 -c 5 -b 3 -p 2841

To submit this SLURM script from the command line one would type:

sbatch SLURM

The Cornell Center  for Advanced Computing has an excellent SLURM training module within the introduction to Stampede2 workshop that goes into detail on how to most effectively make use of SLURM. More examples of SLURM submission scripts can be found on Jon Herman’s Github. Billy also wrote a blog post last year about debugging with SLURM.



Creating Dendrograms in R

A dendrogram is an effective way of visualizing results from hierarchical clustering. The purpose of this post is to show how to make a basic dendrogram in R and illustrate the ways in which one can add colors to dendrogram labels and branches to help identify key clustering drivers. Making dendrograms in R is quite straightforward. However, customizing a dendrogram is not so straightforward, so this post shows some tricks that I learned and should help expedite the process!

First and foremost, your data must be in an appropriate from for hierarchical clustering to be conducted. Table 1 shows an example of how your data can be set up. Four different spatial temperatures projected by CMIP5 models are shown along with various attributes that could be potential driving forces behind clustering: the institution at which the model comes from, the RCP (radiative forcing scenario) used in the model, and the initial conditions with which the model was run.

Table 1: Model Attributes

At this point, it is helpful to add the model names as the row names (shown in the leftmost column) of your data frame, otherwise the dendrogram function will use the row number as a label on the dendrogram which can make it hard to interpret the clustering results.

Next, create a distance matrix, which will be composed of Euclidean distances between pairs of model projections. This is what clustering will be based on. We first create a new data frame composed of just the temperature values (shown below) by removing columns from the Model Attributes table.


Table 2: Temperature Projections

The following code can be used to create Table 2 from the original table and then the distance matrix.

#Create a new data frame with just temperature values

just_temperature=Model_Attributes[ -c(1:4) ]

#Create a distance matrix


Now, one can make the clustering diagram. Here I chose to use complete linkage clustering as the agglomeration method and wanted my dendrogram to be horizontal.

#Perform clustering


#Adjust dimensions of dendrogram so that it fits in plotting window


plot(complete_linkage_cluster,horiz =TRUE)

And that’s it! Here is the most basic dendrogram.


Figure 1: Dendrogram

Now for customization. You will first need to install the “dendextend” library in R.

We have 11 institutions that the models can come from and we want to visualize if institution has some impact on clustering, by assigning a color to the label. Here we use the rainbow color palette to assign each model a color and then replot the dendrogram.


#Create a vector of colors with one color for each institution


#Add colors to the ordered dendrogram
labels_colors(complete_linkage_cluster)= col[Model_Attributes$Institution][order.dendrogram(complete_linkage_cluster)]

#Replot the dendrogram

par(mar=c(3,4,1,15)) #Dendrogram parameters
plot(complete_linkage_cluster,horiz =TRUE)


Figure 2: Dendrogram with Colored Labels

Now suppose we wanted to change the branch colors to show what RCP each model was run with. Here, we assign a color from the rainbow palette to each of the four RCPs and add it to the dendrogram.


col_branches= col[Model_Attributes$RCP][order.dendrogram(complete_linkage_cluster)]

plot(colored_dendrogram,horiz =TRUE)


Figure 3: Dendrogram with Colored Labels and Colored Branches

Now finally, we can change the node shapes to reflect the initial condition. There are 10 total initial conditions, so we’re going to use the first 10 standard pch (plot character) elements to represent the individual nodes.

nodePar = list(lab.cex = 0.6, pch = c(NA,19),cex = 0.7, col = "black") #node parameters

dend1 = colored_dendrogram %>% set("leaves_pch", c(nodes))

plot(dend1,horiz =TRUE)


Figure 4: Dendrogram with Colored Labels, Colored Branches, and Node Shapes

And that’s how you customize a dendrogram in R!

Making Valgrind Easy

Some of this blog’s readers and authors (most notably, Joe Kasprzik) read the title of this post and though “wait, there already is a post about Valgrind in this blog.” And you are right, so in this blog post I will build on the legacy Joe has left us on his post about Valgrind and get into the details of how to use its basic functionalities to get your code right.

Common mistakes when coding in C++

Suppose we have the following code:

#include <stdio.h>

int main() {
    int *var = new int[5]; // you wouldn't do this if the size was always 5, but this makes the argument clear.
    int n = 5;
    int m;

    if (m > n) {
        printf("Got into if statement.\n");
        for (int i = 0; i < 6; ++i) {
            var[i] = i;

    printf("var[5] equals %d\n", var[n]);

Saving the code above in a file called test.cpp, compiling it with g++ to create an executable called "test," and running it with "./test" will return the following output:

bernardoct@DESKTOP-J6145HK ~
$ g++ test.cpp -o test

bernardoct@DESKTOP-J6145HK ~
$ ./test
Got into if statement.
var[5] equals 5

Great, it ran and did not crash (in such a simple code gcc's flag -Wall would have issued a warning saying m was not initialized, but in more complex code such warning may not be issued). However, it would be great if this code had crashed because this would make us look into it and figure out it actually has 3 problems:

  1. We did not assign a value to variable m (it was created but not initialized), so how did the code determine that m was greater than n to get into the code inside the if statement?
  2. The pointer array var was created as having length 5, meaning its elements are numbered 0 to 4. If the for-loop runs from 0 to 5 but element 5 does not exist, how did the code fill it in with the value of variable i when i was 5 in the loop? From the printf statement that returned 5 we know vars[5] equals 5.
  3. The pointer array var was not destroyed after the code did not need it any longer. This is not necessarily a problem in this case, but if this was a function that is supposed to be called over and over within a model there is a change the RAM would be filled with these seemingly inoffensive pointer arrays and the computer would freeze (or the node, if running on a cluster, would possibly crash and have to be rebooted).

Given C++ will not crash even in the presence of such errors, one way of making sure your code is clean is by running it through Valgrind. However, most people who has used Valgrind on a model that has a few hundreds or thousands of lines of code has gotten discouraged by its possibly long and cryptic-looking output. However, do not let this intimidate you because the output is actually fairly easy to read once you either learn what to look for or use Valkyrie, a graphical user interface for Valgrind.

Generating and interpreting Valgrind’s output

The first think that needs to be done for Valgrind to give you a meaningful output is to re-compile your code with the -O0 and -g flags, the former to prevent the compiler from modifying your code to make it more efficient but unintelligible to Valgrind (or to debuggers), and the latter for Valgrind (and debuggers) to be able to pinpoint the line of code where issues happen and are originated. Therefore, the code should be compiled as shown below:

bernardoct@DESKTOP-J6145HK ~
$ g++ -O0 -g test.cpp -o test

Now it is time to run your code with Valgrind to perform some memory checking. Valgrind itself will take flags that will dictate the type of analysis to be performed. Here we are interested in checking memory misuse (instead profiling, checking for thread safety, etc.), so the first flag (not required, but good to keep things for yourself) should be --tool=memcheck. Now that we specified that we want Valgrind to run a memory check, we should specify that we want it to look in detail for memory leaks and tell us where the erros are happening and originating, which can done by passing flags --leak-check=full and --track-origins-yes. This way, the complete function call to run Valgrind on our test program is:

bernardoct@DESKTOP-J6145HK ~
$ valgrind --tool=memcheck --leak-check=full --track-origins=yes ./test

Important: Beware that your code will take orders of magnitude longer to run with Valgrind than it would otherwise. This means that you should run something as small as possible but still representative — e.g. instead of running your stochastic model with 1,000 realizations and a simulation time of 50 years, consider running 2 realizations simulating 2 years, so that Valgrind analyzes the year-long simulation and the transition between realizations and years. Also, if running your code on a cluster, load the valgrind module with module load valgrind-xyz on your submission script and replace the call to your model on the submission script by the valgrind call above you can find the exact name of the Valgrind module by running module avail on the terminal. If running valgrind with a code that used MPI, use mpirun valgrind ./mycode -flags.

When called, valgrind will instrument our test.cpp and based on the collected information will print the following on the screen:

==385== Memcheck, a memory error detector
==385== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==385== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==385== Command: ./test
==385== Conditional jump or move depends on uninitialised value(s)
==385==    at 0x4006A9: main (test.cpp:9)
==385==  Uninitialised value was created by a stack allocation
==385==    at 0x400686: main (test.cpp:3)
Got into if statement.
==385== Invalid write of size 4
==385==    at 0x4006D9: main (test.cpp:12)
==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
==385==    by 0x400697: main (test.cpp:5)
==385== Invalid read of size 4
==385==    at 0x4006F5: main (test.cpp:16)
==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
==385==    by 0x400697: main (test.cpp:5)
var[5] equals 5
==385==     in use at exit: 20 bytes in 1 blocks
==385==   total heap usage: 3 allocs, 2 frees, 73,236 bytes allocated
==385== 20 bytes in 1 blocks are definitely lost in loss record 1 of 1
==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
==385==    by 0x400697: main (test.cpp:5)
==385==    definitely lost: 20 bytes in 1 blocks
==385==    indirectly lost: 0 bytes in 0 blocks
==385==      possibly lost: 0 bytes in 0 blocks
==385==    still reachable: 0 bytes in 0 blocks
==385==         suppressed: 0 bytes in 0 blocks
==385== For counts of detected and suppressed errors, rerun with: -v
==385== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)

Seeing Valgrind’s output being 5 times as long as the test code itself can be somewhat disheartening, but the information contained in the output is really useful. The first block of the output is the header it will always be printed so that you know the version of Valgrind you have been using, the call for your own code it used, and so on. In our example, the header is:

==385== Memcheck, a memory error detector
==385== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==385== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==385== Command: ./test

After that, Valgrind report the errors it found during the execution of your code. Errors are always reported as a description of the error in good old English, followed by where it happens in your code. Let’s look at the first error found by Valgrind:

==385== Conditional jump or move depends on uninitialised value(s)
==385==    at 0x4006A9: main (test.cpp:9)
==385==  Uninitialised value was created by a stack allocation
==385==    at 0x400686: main (test.cpp:3)

This tells us that there is an if statement (conditional statement) on line 9 of test.cpp in which at least one of the sides of the logical test has at least one uninitialized variable. As pointed out by Valgrind, line 9 of test.cpp has our problematic if statement which compares initialized variable n to uninitialized variable m, which will have whatever was put last in that memory address by the computer.

The second error block is the following:

==385== Invalid write of size 4
==385==    at 0x4006D9: main (test.cpp:12)
==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
==385==    by 0x400697: main (test.cpp:5)

This means that your code is writing something in a location of memory that it did not allocated for its use. This block says that the illegal write, so to speak, happened in line 12 of test.cpp through a variable created in line 5 of test.cpp using the new[] operator. These lines correspond to var[i] = i; and to int *var = new int[5];. With this, we learned that either var was created too short on line 5 of test.cpp or that the for loop that assigns values to var goes one or more steps too far.

Similarly, the next block tells us that our printf statement used to print the value of var[5] on the screen has read past the amount of memory that was allocated to var in its declaration on line 5 of test.cpp, as shown below:

==385== Invalid read of size 4
==385==    at 0x4006F5: main (test.cpp:16)
==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
==385==    by 0x400697: main (test.cpp:5)

The last thing Valgrind will report is the information about memory leaks, which are accounted for when the program is done running. The output about memory leaks for our example is:

==409==     in use at exit: 20 bytes in 1 blocks
==409==   total heap usage: 3 allocs, 2 frees, 73,236 bytes allocated
==409== 20 bytes in 1 blocks are definitely lost in loss record 1 of 1
==409==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
==409==    by 0x400697: main (test.cpp:5)
==409==    definitely lost: 20 bytes in 1 blocks
==409==    indirectly lost: 0 bytes in 0 blocks
==409==      possibly lost: 0 bytes in 0 blocks
==409==    still reachable: 0 bytes in 0 blocks
==409==         suppressed: 0 bytes in 0 blocks

The important points to take away from this last block are that:

  1. there were 20 bytes of memory leaks, meaning that if this were a function in your code every time it was run it would leave 20 bytes of garbage sitting in the RAM. This may not sound like a big deal but imagine if your code leaves 1 MB of garbage in the RAM for each of the 100,000 times a function is called. With this, there went 100 GB of RAM and everything else you were doing in your computer at that time because the computer will likely freeze and have to go through a hard-reset.
  2. the memory you allocated and did not free was allocated in line line 5 of test.cpp when you used the operator new[] to allocate the integer pointer array.

It is important to notice here that if we increase the amount of allocated memory by the new[] operator on line 5 to that corresponding to 6 instead of 5 integers, the last two errors (invalid read and invalid write) would disappear. This means that if you run your code with Valgrind and see hundreds of errors, chances are that it will take modifying a few lines of code to get rid of most of these errors.

Valkyrie — a graphical user interface for Valgrind

Another way of going through Valgrind’s output is by using Valkyrie (now installed in the login node of Reed’s cluster, The Cube). If you are analyzing your code from your own computer with a Linux terminal (does not work with Cygwin, but you can install a native Ubuntu terminal on Windows 10 by following instructions posted here) and do not have Valkyrie installed yet, you can install it by running the following on your terminal:

bernardoct@DESKTOP-J6145HK ~
$ sudo apt-get install valkyrie

Valkyrie works by reading an xml file exported by Valgrind containing the information about the errors it found. To export this file, you need to pass the flags --xml=yes and --xml-file=valgring_output.xml (or whatever name you want to give the file) to Valgrind, which would make the call to Valgrind become:

bernardoct@DESKTOP-J6145HK ~
$ valgrind --tool=memcheck --leak-check=full --track-origins=yes --xml=yes --xml-file=valgring_output.xml ./test

Now, you should have a file called “valgrind_output.xml” in the directory you are calling Valgrind from. To open it with Valkyrie, first open Valkyrie by typing valkyrie on your terminal if on Windows 10 you need to have Xming installed and running, which can be done by following the instructions in the end of this post. If on a cluster, besides having Xming open you also have to have ssh’ed into the cluster with the -X flag (e.g. by running ssh -X with either Cygwin or from a native Linux terminal. After opening Valkyrie, click on the green folder button and select the xml file, as in the screenshot below.


After opening the xml file generated by Valgrind, Valkyrie should look like in the screenshot below:valkyrie_screenshot2

Now you can begin from a collapsed list of errors and unfold each error to see its details. Keep in mind that Valkyrie is not your only option of GUI for Valgrind, as IDEs like JetBrains’ CLion and QTCreator come integrated with Valgrind. Now go check your code!

PS: Thanks to folks on Redit for the comments which helped improve this post.

What is resilience in water resources systems?

Ah, resilience. Everybody talks about it, everybody has an idea (and sometimes a precise one) about what it means, but nobody seems to agree about what it is that it means. This post is about providing an overview of that topic for water resources systems: why people talk about it, how we can understand it, why it matters for adaptation and why it is more of an exciting research topic than a buzzword. I prepared a video to answer all those four questions in just over 4 minutes. Here it is.

An additional question would be: what are the connections with other topics abundantly described in this blog, such as robustness? Clues to the answer to this are in the video (a.k.a you should definitely watch it), and the answer to this is that resilience can be thought of as being robustness, or recovery if we are not robust. Academics and practitioners alike tend to focus on one of these aspects depending on the characteristics of the problem(s) they focus on.

That short answer can look a bit too easy, but it works whatever it is we are assessing robustness or resilience to: events, long-term changes, surprises (the state of the world is not what we thought), etc.

In practice that means that when we are concerned with resilience, we are concerned with three problems at once: 1) robustness within a safe operating space that we define according to our preferences, perceptions, the state of the system, etc., 2) a recovery problem outside of that safe space, and 3) where is our limit? Everything that has been done (in the Reed group and beyond) on robustness still applies when we operate our system in our safe space! Part of my work now is eliciting what is similar and what is different in the recovery problem, compared with the robustness problem. Another part is understanding what shapes the limit of a safe operating space.

More to follow on that soon… in the meantime, whoever is interested in how resilience amounts to solving two separate problems depending on the state of our system, can look up that work from my PhD thesis (yes, it even contains an application of the shallow lake problem!).

Geeky appendix

How did I make this video? I used the free and open-source Kazam software to record myself, screenshot and voice. Then I edited the video a bit with OpenShot (you have to leave your main screen to stop th recording with Kazam, so I had to edit that out), an equally free and open-source software. Both were easy to use and I would recommend them if you want to make this kind of video. One last thing: it is definitely interesting to record oneself talking! It is not easy to hold off the hesitations and quirks in your speech for 4 minutes…

A completely non-exhaustive list of tutorial resources for scientific computing

This is a short blog post to put together a list of resources with tutorials (similar to what’s usually found on this blog) for various programming languages. It is by no means exhaustive, so please comment if you feel there’s an important one I left out.


Kinda new blog with Matlab tutorials on numerical methods

One of the MathWorks blogs, great tutorials.

Matlab tutorials from MathWorks

More Matlab tutorials, a lot of material on many topics


Not updated very frequently, but good data analysis and visualization tutorials are posted

Updated regularly, some great data visualization and analysis tutorials. The tutorials come with videos. I really like this site.

Updated regularly (about once a month) with python tutorials and general python coding insights. I really like the writing style of this author.

Tutorials on using SciPy

C and C++:

C and C++ programming tutorials, tips and tricks

Not really updated anymore but some good basic tutorials are listed

Hadn’t been updated in a while, but it looks like it’s been picked up again. Good for general C++ programming and good practice.

C++ tutorials


This is a great general resource not devoted to a particular language. They cover topics in data science, machine learning and programming in general. Some great tutorials for Python and R.

Mathematical programming problems to help with learning any language

Free programming books repository

Reddits (some of the bigger ones):

/r/matlab (General on matlab, community provides help with coding)

/r/programming (General on programming)

/r/learnprogramming (Community provides help with debugging questions)

/r/python (General on python, community provides help with coding)

/r/learnpython (Community provides help with python questions, smaller than /r/python)

/r/cpp (General on C++)

/r/cpp_questions (Community provides help with C++ questions)

I’ve also recently made /r/sci_comp which has very little activity for the moment, but the aim is to create a community with general resources on coding for scientific applications.


Installing a Native Ubuntu Terminal on Windows 10

Modern clusters and cloud platforms requires users to login through SSH on Unix terminals in order to have access to its functionalities. Such terminals also give its users access to a wide range of commands and programs such as awk and Vim which are not available on Windows. Linux and Mac users have such terminals installed by default, but Windows users normally have to install Cygwin or use a browser-based terminal, both with limited capabilities such as the inability to install new packages.

Fortunately, Microsoft has established a partnership with Canonical (Ubuntu’s parent company) which brought part of the Linux kernel to Windows 10, allowing users to install Ubuntu’s terminal on Windows through official means without the need for compatibility layers. Using Ubuntu’s terminal on Windows has the advantages of being able to use apt-get and dpkg to install new packages, which was not possible with Cygwin, and of running Python and C/C++ codes faster. Here are the steps to install Ubuntu terminal on Windows 10:

  1. On Windows Settings, click on “Update & Security.” There, click on “For Developers” close to the bottom on left pane and turn on the option “Developer Mode.”
  2. Windows Settings, click on “Apps” -> “Programs and Features” (right pane) -> “Turn Windows features on or off”  (left pane) and check “Windows Subsystem for Linux.”


  1. Restart your computer.
  2. Open Windows PowerShell as administrator, type the following line and press enter:
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
  1. Open Microsoft Store (Windows’ app store), look for “Ubuntu Terminal,” and install it.
  2. Now you should have it installed and a shortcut on your quick-start bar.
  3. Open the Ubuntu Terminal and type:
sudo apt-get update

In order to install programs such as the Intel compiler and profiler (free for students), pip, Vim, GNUPlot or the most recent version of GCC, just type:

sudo apt-get install program_to_be_installed

If the package you installed has graphical components, such GNUPlot and Python/Matplotlib, you will need to install a program on Windows to display the graphical components from the Ubuntu terminal. One such option is Xming. To use Xming, follow the following steps:

    1. Install Xming from here.
    2. Run it. Click “next” until you can click on “Finish.” This process will have to be repeated every time you open the terminal.
    3. Open Ubuntu terminal
    4. Type the following and press enter:
echo "export DISPLAY=localhost:0.0" >> .bashrc
    1. Close and re-open the terminal
    2. In order to make sure you can run graphic applications, run the following two commands:
sudo apt-get install x11-apps

Preparing Data for a Time Series Analysis

This semester, I had the opportunity to take Dr. Scott Steinschneider’s new class, Hydrologic Engineering in a Changing Climate, that is offered here at Cornell. In this class, we covered time series analysis, extreme value modeling, and trend tests. I chose to do a final project which focused on using a time series approach to forecast electricity demand in California. As I worked through my project, it became apparent to me that data are rarely in the form where a time series model can be applied directly. Consequently, multiple transformations are usually necessary before a model can be fit to the data set. In this post, I outline some of the inherent characteristics of data that might warrant a transformation as well as the steps that can be taken to address these problems.

Given a data set that you would like to fit any Box-Jenkins model to, you should ask yourself the following two questions?

  1. Is normality a reasonable assumption for the residuals?
  2. Are the data stationary?


Normality can be checked before fitting the model because if the original data are not normal, then there is a good chance that the residuals won’t be as well. If you fit a histogram to the data and it looks like Figure 1, you probably need to apply some form of normalizing transformation.


Figure 1: Histogram of Monthly Flow

Two traditional transformations that you can try are a log transform or a Box-Cox transform, shown in the following two equations, where xt is the original data point.

Log Transform:


Box-Cox Transform:


Sometimes a log transformation can be too drastic and skew the data the opposite way. The Box-Cox transform is effectively a less intense transformation that one can try if the log transform is not suitable. Note that when λ=0, the Box-Cox transform reduces to a simple log transform.

The powerTransform function in the R package, car, can be used to find a lambda that will maximize normality.


For a data set to exhibit stationarity, the following three principles must be true for us to be confident that our model will represent our data well:

For some lag term, s,

  1. E[xt]=E[xt+s] (The mean of the data set does not change with time)
  2. Var[xt]=Var[xt+s] (The variance of the data set does not change with time)
  3. Cov[xt,xt+s]= γ (The covariance between data points is some constant value,γ)

Outlined below are some of the characteristics of a data set that can cause a violation of one or more of these principles.


Seasonality in data can exist if a time series pattern repeats over a fixed and known period. Figure 2 shows monthly inflow into the Schoharie Creek Reservoir. Periodicity is apparent, but it isn’t until we look at the autocorrelation function (ACF) of the data, shown in Figure 3, that we see that there is a clear repetition occurring every 12 months.


Figure 2: Monthly Inflow for the Schoharie Creek


Figure 3: ACF of Monthly Inflow

One effective way to get rid of this monthly seasonality is to use the following de-seasonalizing equation:


The seasonality is removed from each data point by subtracting the corresponding monthly mean (xmt) and dividing by the month’s standard deviation ( smt). This equation can also be used to account for daily or yearly seasonality as well.

Differencing is another way to address seasonality in data. A seasonal difference is the difference between an observation and the corresponding observation from the previous year.


Where m=12 for monthly data, m=4 for quarterly data, and so on 1.


A trend, shown in the first panel of Figure 4, is a clear violation of the first requirement for stationarity. There are a couple options that one can implement to deal with trends: differencing and model fitting.


Figure 4: De-trending process1

From the above figures, it is clear that differencing can be used to account for seasonality but can also be used to dampen a trend. A first difference is performed by subtracting the value of the current observation from the one in the time step before. It can be applied as follows:


 If the transformed data is plotted and still has a trend, a second difference can further be applied.


It is important to note the distinction between seasonal and first differences. Seasonal differencing is the difference from one year to the next, while first differencing is the difference between one observation and the next. Seasonal and trend differencing can both be applied, but sometimes, if seasonal differencing is performed first, it will remove the need for further differencing1.

In Figure 4, note how a log transform, seasonal differencing, and second differencing is necessary to ultimately remove the trend.



Figure 5: Modeling Fitting with Ordinary Least Squares2

If a monotonic trend is observed, such as the one in Figure 5, a model fitting can be performed. In this example, a linear model is fit to the trend by choosing coefficients that minimize the sum of squares. This model is then subtracted from the original data to give residuals. The goal is for the resulting residuals to be stationary. Note that a polynomial model can also be fit to the trend if appropriate2 .


Heteroscedasticity describes the phenomenon when the data do not exhibit a constant variance. This is a violation of the second principle. Heteroscedasticity tends to appear in financial time series (i.e. prices of stocks and bonds) which can be very volatile, but it appears less so in hydrological data3. I did not have to address heteroscedasticity in the electricity load data for my project, and some statisticians suggest that one doesn’t have to deal with it unless it is very severe as weak heteroscedasticity tends be taken care of with normalization and de-seasonalization.

One way to check for heteroscedasticity in a time series is with the McLeod-Li test for conditional heteroscedasticity. If heteroscedasticity is present, consider using an ARCH/GARCH model, if an AR or ARMA model can be fit to the data, respectively, or a hybrid ARCH-ARIMA model if the latter models are not appropriate.

Choosing a Time Series Model

Once the necessary transformations have been performed, you are ready to fit a time series model to your data. R has a some useful packages for this: forecast and stats. Some helpful functions in these packages include:

auto.arima (forecast) – This function tells you what model is the best fit for your data, the coefficients for the lag terms, and variance of errors (along with other useful information).

arima.sim (stats) – This function allows you to simulate a set of data from your time series model.

predict (stats) – This function will provide a prediction for n time steps into the future based on the chosen time series model. Keep in mind it is best when used to predict just the next few time step.

Finally, remember that back-transformations must be performed on all simulations or predictions to get them into back into the original space.

*For a really helpful explanation of different time series notation, check this previous post.


*All information or figures not specifically cited came from class notes and homework from Dr. Scott Steinschneider’s class

(1) Stationarity and Differencing:

(2) Removal of Trend and Seasonality, UC Berkeley:

(3) Heteroscedasticity: