September 29, 2017 by William Raseman

Time Series Modeling: ARIMA Notation

A quick note!

If you are looking for more exhaustive resources on time series modeling, check out Forecasting: Principles and Practice and Penn State 510: Applied Time Series Analysis. These have time series theory plus examples of how to implement it in R. (for a more detailed description of these resources, see the ‘References’ section)

Motivation

Hydrological, meteorological, and ecological observations are often a special type of data: a time series. A time series consists of observations (say streamflow) at equally-spaced intervals over some period of time. Many of us on this blog are interested in running simulation-optimization models which receive time series data as an input. But the time series data from the historical record may be insufficient for our work, we also want to create synthetic time series data to explore a wider range of scenarios. To do so, we need to fit a time series model. If you are uncertain why we would want to generate synthetic data, check out Jon L.’s post “Synethic streamflow generation” for some background. If you are interested in some applications, read up on this 2-part post from Julie.

A common time series model is the autoregressive moving average (ARMA) model. This model has many variations including the autoregressive integrated moving average (ARIMA), seasonal ARIMA (SARIMA) models, and ARIMA models with external covariates (ARIMAX and SARIMAX). This class of models is useful but it has its own special notation which can be hard to unpack. Take the SARIMA model for example:

$\Phi(B^S)\phi(B)(x_t-\mu)W_t=\Theta(B^S)\theta(B)\epsilon_t\qquad\qquad(1)$

Confused yet? Me too. What are those functions? What does the B stand for? To help figure that out, I’m going to break down some time series notation into bite-sized pieces. In this post, I will unpack the ARMA model (eq. 2). If you are interested in understanding (eq. 1) check out Penn State 510: Applied Time Series Analysis – Lessons 4: Seasonal Models.

$\phi(B)(x_t-\mu)=\theta(B)\epsilon_t\qquad\qquad(2)$

Autoregressive (AR) and moving average (MA) models

An ARMA model is generalized form of two different models: the autoregressive (AR) and moving average (MA). Both the AR (eq. 3) and MA (eq. 4) models have a single parameter, p and q, respectively, which represent the order of the model.

$x_t=c+\phi_1x_t_-_1+\phi_2x_t_-_2+...+\phi_px_t_-_p+\epsilon_t\qquad\qquad(3)$

$MA(q): x_t=\mu+\epsilon_t+\theta_1\epsilon_t_-_1+\theta_2\epsilon_t_-_2+...+\theta_q\epsilon_t_-_q \qquad\qquad(4)$

The c and μ are constants, x’s are the time series observations, θ’s and Φ’s are weighting parameters for the different lagged terms, and ε represents a random error term (i.e. it has a normal distribution with mean zero). You can see already how these equations might get a bit tedious to write out. Using what is known as a backshift operator and defining specific polynomials for each model, we can use less ink to get the same point across.

Backshift operator

The backshift (also known as the lag) operator, B, is used to designate different lags on a particular time series observation. By applying the backshift operator to the observation at the current timestep, x_t, it yields the one from the previous timestep x_t-1(also known as lag 1).

$Bx_t=x_t_-_1$

It doesn’t save much ink in this simple example, but with more model terms the backshift operator comes in handy. Using this operator, we can represent any lagged term by raising B to the power of the desired lag. Let’s say we want to represent the lag 2 of x_t.
$B^2x_t=x_t_-_2$

Or possibly the lag 12 term.

$B^1^2x_t=x_t_-_1_2$

Example 1: AR(2) – order two autoregressive model

Let’s apply the backshift operator to the AR(2) model as an example. First, let’s specify the model in our familiar notation.

$AR(2): x_t=c+\phi_1x_t_-_1+\phi_2x_t_-_2+\epsilon_t$

Now, let’s apply the backshift operator.

$AR(2): x_t=c+\phi_1Bx_t+\phi_2B^2x_t+\epsilon_t$

Notice that x_t. shows up a few times in this equation, so let’s rearrange the model and simplify.

$(1-\phi_1B-\phi_2B^2)x_t=c+\epsilon_t$

Once we’ve gotten to this point, we can define a backshift polynomial to further distill this equation down. For order two autoregressive models, this polynomial is defined as

$\phi(B)=1-\phi_1B-\phi_2B^2$

Combine this with the above equation to get the final form of the AR(2) equation.

$\phi(B)x_t=c+\epsilon_t\newline\newline where\ \phi(B)=1-\phi_1B-\phi_2B^2$

Example 2: MA(1) – order one moving average model

Starting to get the hand of it? Now we’re going to apply the same approach to a MA(1) model.

$MA(1): x_t=\mu+\epsilon_t+\theta_1\epsilon_t_-_1$

Now let’s apply the backshift operator.

$x_t=\mu+\epsilon_t+\theta_1B\epsilon_t$

Rearrange and simplify by grouping ε_t terms together.

$x_t=\mu+(1-\theta_1B)\epsilon_t$

Define a backshift polynomial to substitute for the terms in the parentheses.

$\theta(B)=1-\theta_1B$

Substitute polynomial to reach the compact notation.

$x_t=\mu+\theta(B)\epsilon_t \newline\newline where\ \theta(B)=1-\theta_1B$

Autoregressive moving average (ARMA) models

Now that we’ve had some practice with the AR and MA models, we can move onto ARMA models. As the name implies, the ARMA model is simply a hybrid between the AR and MA models. As a shorthand, AR(p) is equivalent to ARMA(p,0) and MA(q) is the same as ARMA(0,q). The full ARMA(p,q) model is as follows:

$ARMA(p,q):x_t=\mu+\epsilon_t+\phi_1x_t_-_1+\phi_2x_t_-_t+...+\phi_px_t_-_p+\theta_1\epsilon_t_-_1+\theta_2\epsilon_t_-_2+...+\theta_q\epsilon_t_-_q$

Example 3: ARMA(2,2)

For the grand finale, let’s take the ARMA model from it’s familiar (but really long) form and put in it more compact notation. As an example we’ll look at the ARMA(1,2) model.

$ARMA(1,2):x_t=\mu+\epsilon_t+\phi_1x_t_-_1+\theta_1\epsilon_t_-_1+\theta_2\epsilon_t_-_2$

First, apply the backshift operator.

$x_t=\mu+\epsilon_t+\phi_1Bx_t+\theta_1B\epsilon_t+\theta_2B^2\epsilon_t$

Rearrange and simplify by grouping the terms from the current timestep, t. (If you are confused by this step check out “Clarifying Notes #2”)

$(1-\phi_1B)(x_t-\mu)=(1+\theta_1B+\theta_2B^2)\epsilon_t$

Substitute the polynomials defined for AR and MA to reach the compact notation.

$\phi(B)(x_t-\mu)=\theta(B)\epsilon_t$

$where\newline\phi(B)=1-\phi_1B\newline\theta(B)=\theta_1B+\theta_2B^2$

And that’s it! Hopefully that clears up ARMA model notation for you.

Clarifying Notes

There are many different conventions for the symbols used in these equations. For example, the backshift operator (B) is also known as the lag operator (L). Furthermore, sometimes the constants used in AR, MA, and ARMA models are omitted with the assumption that they are centered around 0. I’ve decided to use the form which corresponds to agreement between a few sources with which I’m familiar and is consistent with their Wikipedia pages.
What does it mean for a backshift operator to be applied to a constant? For example, like for μ in equation 2. Based on my understanding, a backshift operator has no effect on constants: Bμ = μ. This makes sense because a backshift operator is time-dependent but a constant is not. I don’t know why some of these equations have constants multiplied by the backshift operator but it appears to be the convention. It seems to be more confusing to me at least.
One question you may be asking is “why don’t we just use summation terms to shorten these equations?” For example, why don’t we represent the AR(p) model like this?

$AR(p):x_t=c+\sum_{i=1}^{p}\phi_ix_t_-_i+\epsilon_t$

We can definitely represent these equations with a summation, and for simple models (like the ones we’ve discussed) that might make more sense. However, as these models get more complicated, the backshift operators and polynomials will make things more efficient.

References

Applied Time Series Anlysis, The Pennsylvania State University: https://onlinecourses.science.psu.edu/stat510/

Note: This is a nice resource for anyone looking for a more extensive resource on time series analysis. This blogpost was inspired largely by my own attempt to understand Lessons 1 – 4 and apply it to my own research.

Chatfield, Chris. The Analysis of Time Series: An Introduction. CRC press, 2016.

Hyndman, Rob J., and George Athanasopoulos. Forecasting: Principles and Practice. Accessed October 27, 2017. http://Otexts.org/fpp2/.

Note: This is an AWESOME resource for everything time series. It is a bit more modern than the Penn State course and is nice because it is based around the R package ‘forecast’ and has a companion package ‘fpp2’ for access to data. Since it is written by the author of ‘forecast’ (who has a nice blog and is a consistent contributor to Cross Validated and Stack Overflow), it is consistent in its approach throughout the book which is a nice bonus.

Wikipedia: https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model

September 26, 2017 by charlesrouge

Introduction to Docker

In this post we’ll learn the principles of Docker, and how to use Docker with large quantities of data in input / output.

1. What is Docker?

Docker is a way to build virtual machines from a file called the Docker file. That virtual machine can be built anywhere with the help of that Docker file, which makes Docker a great way to port models and the architecture that is used to run them (e.g., the Cube: yes, the Cube can be ported in that way, with the right Docker file, even though that is not the topic of this post). Building it creates an image (a file), and a container is a running instance of that image, where one can log on and work. By definition, containers are transient and removing does not affect the image.

2. Basic Docker commands

This part assumes that we already have a working Docker file. A docker file runs a series of instructions to build the container we want to work in.

To build a container for the WBM model from a Docker file, let us go to the folder where the Docker file is and enter:

docker build -t myimage -f Dockerfile .

The call docker build means that we want to run a Docker file; -t means that we name, or “tag” our image, here by giving it the name of “myimage”; -f specifies which Docker file we are using, in case there are several in the current folder, and “.” says that we run the Docker file and build the container in the current folder. Options -t and -f are optional in theory, but the tag -t is very important as it gives a name to your built image. If we don’t do that, we’ll have to go through the whole build every time we want to run a Docker container from the Docker file. This would waste a lot of time.

Once the Docker image is built, we can run it. In other words, have a virtual machine running on the computer / cluster / cloud where we are working. To do that, we enter:

docker run -dit myimage

The three options are as follows: -d means that we do not directly enter the container, and instead have it running in the background, while the call returns the containers hexadecimal ID. -i means that we keep the standard input open. Finally, -t is our tag, which is the name of the docker image (here, “myimage”).

We can now check that the image is running by listing all the running images with:

docker ps

In particular, this lists displays a list of hexadecimal IDs associated to each running image. After that, we can enter the container by typing:

 docker exec -i -t hexadecimalID /bin/bash

where -i is the same as before, but -t now refers to the hexadecimal ID of the tagged image (that we retrieved with docker ps). The second argument /bin/bash simply sets the directory of the shell in a standard way.

Once in the container, we can run all the processes we want. Once we are ready to exit the container, we can exit it by typing… exit.

Once outside of the container, we can re-enter it as long as it still runs. If we want it to stop running, we use the following command to “kill” it (not my choice of words!):

 docker kill hexadecimalID

A short cut to calling all these commands in succession is to use the following version of docker run:

 docker run -it myimage /bin/bash

This command logs us onto the image as if we had typed run and exec at the same time (using the shell /bin/bash). Note that option -d is not used in this call. Also note that upon typing exit, we will not only exit the container, but also kill the running Docker image. This means that we don’t have to retrieve its hexadecimalID to log on to the image, nor to kill it.

Even if the container is not running any more, it can be re-started and re-entered by retrieving its hexadecimal ID. The docker ps command only lists running containers, so to list all the containers, including those that are no longer running, we type:

 docker ps -a

We can then restart and re-enter the container with the following commands:


docker restart hexadecimalID

docker exec -it hexadecimalID /bin/bash

Note the absence of options for docker restart. Once we are truly done with a container, it can be removed from the lists of previously running containers by using:

 docker rm hexadecimalID

Note that you can only remove a container that is not running.

3. Working with large input / output data sets.

Building large quantities of data directly into the container when calling docker build has three major drawbacks. First, building the docker image will take much more time because we will need to transfer all that data every time we call docker build. This will waste a lot of time if we are tinkering with the structure of our container and are running the Docker file several times. Second, every container will take up a lot of space on the disk, which can prove problematic if we are not careful and have many containers for the same image (it is so easy to run new containers!). Third, output data will be generated within the container and will need to be copied to another place while still in the container.

An elegant workaround is to “mount” input and output directories to the container, by calling these folders with the -v option as we use the docker run command:

 docker run -it -v path/to/inputs -v path/to/outputs myimage /bin/bash

 docker run -dit -v path/to/inputs -v path/to/outputs myimage

The -v option is abbreviation for “volume”. This way, the inputs and outputs directories (set on the same host as the container) are used directly by the Docker image. If new outputs are produced, they can be added directly to the mounted output directory, and that data will be kept in that directory when exiting / killing the container. It is also worth noting that we don’t need to call -v again if we restart the container after killing it.

A side issue with Docker is how to manage user permissions on the outputs a container produces, but 1) that issue arises whether or not we use the -v option, and 2) this is a tale for another post.

Acknowledgements: thanks to Julie Quinn and Bernardo Trindade from this research group, who started exploring Docker right before me, making it that much easier for me to get started. Thanks also to the Cornell-based IT support of the Aristotle cloud, Bennet Wineholt and Brandon Baker.

September 22, 2017 by Antonia Hadjimichael

Exploring the stability of systems of ordinary differential equations – an example using the Lotka-Volterra system of equations

Stability when dealing with dynamical systems is important
because we generally like the systems we make decisions on to be predictable.
As such, we’d like to know whether a small change in initial conditions could
lead to similar behavior. Do our solutions all tend to the same point? Would
slightly different initial conditions lead to the same or to a completely
different point for our systems.

This blogpost will consider the stability of dynamical systems of the form:

The equilibria of which are denoted by x* and y*,
respectively.

I will use the example of the Lotka-Volterra system of
equations, which is the most widely known method of modelling many predator-prey/parasite-host interactions encountered in natural systems. The Lotka-Volterra predator-prey equations were discovered independently by both Alfred Lotka and Vito Volterra in 1925-26. Volterra got to these equations while trying to explain why, immediately after WWI, the number of predatory fish was much larger than before the war.

The system is described by the following equations:

Where a, b, c, d > 0 are the parameters describing the
growth, death, and predation of the fish.

In the absence of predators, the prey population (x) grows
exponentially with an intrinsic rate of growth b.

Total predation is proportional to the abundance of prey and
the abundance of predators, at a constant predation rate a.

New predator abundance is proportional to the total
predation (axy) at a constant conversion rate c.

In the absence of prey, the predator population decreases at
a mortality rate d.

The system demonstrates an oscillating behavior, as
presented in the following figure for parameters a=1, b=1, c=2, d=1.

Volterra’s explanation for the rise in the numbers of
predatory fish was that fishing reduces the rate of increase of the prey
numbers and thus increases the rate of decrease of the predator. Fishing does
not change the interaction coefficients. So, the number of predators is
decreased by fishing and the number of prey increases as a consequence. Without
any fishing activity (during the war), the number of predators increased which
also led to a decrease in the number of prey fish.

To determine the stability of a system of this form, we
first need to estimate its equilibria, i.e. the values of x and y for which:

An obvious equilibrium exists at x=0 and y=0, which kinda
means that everything’s dead.

We’ll first look at a system that’s still alive, i.e x>0
and y>0:

And

Looking at these expressions for the equilibria we can also
see that the isoclines for zero growth for each of the species are straight
lines given by b/a for the prey and d/ca for the predator, one horizontal and
one vertical in the (x,y) plane.

In dynamical systems, the behavior of the system near an
equilibrium relates to the eigenvalues of the Jacobian (J) of F(x,y) at the equilibrium.
If the eigenvalues all have real parts that are negative, then the equilibrium
is considered to be a stable node; if the eigenvalues all have real parts that
are positive, then the equilibrium is considered to be an unstable node. In the
case of complex eigenvalues, the equilibrium is considered a focus point and
its stability is determined by the sign of the real part of the eigenvalue.

I found the following graphic from scholarpedia to be a
useful illustration of these categorizations.

So we can now evaluate the stability of our equilibria.
First we calculate the Jacobian of our system and then plug in our estimated
equilibrium.

To find the eigenvalues of this matrix we need to find the
values of λ that satisfy: det⁡(J-λI)=0 where I is
the identity matrix and det denotes the determinant.

Our eigenvalues are therefore complex with their real parts equal to 0. The equilibrium is therefore a focus point, right between instability and asymptotic stability. What this means for the points that start out near the equilibrium is that they tend to both converge towards the equilibrium and away from it. The solutions of this system are therefore periodic, oscillating around the equilibrium point, with a period , with no trend either towards the
equilibrium or away from it.

One can arrive at the same conclusion by looking at the
trace (τ) of the Jacobian and its determinant (Δ).

The trace is exactly zero and the determinant is positive
(both d,b>0) which puts the system right in between stability and
instability.

Now let’s look into the equilibrium where x*=0 and y*=0, aka
the total death.

Both b and d are positive real numbers which means that the
eigenvalues will always have real values of different signs. This makes the
(0,0) an unstable saddle point. This is important because if the equilibrium of
total death were a stable point, initial low population levels would tend to
converge towards their extinction. The fact that this equilibrium is unstable
means that the dynamics of the system make it difficult to achieve total death
and that prey and predator populations could be infinitesimally close to zero
and still recover.

Now consider a system where we’ve somehow killed all the
predators (y=0). The prey would continue to grow exponentially with a growth
rate b. This is generally unrealistic for real-life systems because it assumed
infinite resources for the prey. A more realistic model would consider the prey
to exhibit a logistic growth, with a carrying capacity K. The carrying capacity of a biological species is the maximum population size of the species that can be sustained indefinitely given the necessary resources.

The model therefore becomes:

Where a, b, c, d, K > 0.

To check for this system’s stability we have to go through
the same exercise.

The predator equation has remained the same so:

For zero prey growth:

Calculating the eigenvalues becomes a tedious exercise at
this point and the time of writing is 07:35PM on a Friday. I’d rather apply a
small trick instead and use the isoclines to derive the stability of the system. The isocline for the predator zero-growth has remained the same (d/ca), which is a straight line (vertical on the (x,y) vector plane we draw before). The isocline for the prey’s zero-growth has changed to:

Which is again a straight line with a slope of –b/aK, i.e.,
it’s decreasing when moving from left to right (when the prey is increasing). Now looking at the signs in the Jacobian of the first system:

We see no self-dependence for each of the two species (the
two 0), we see that as the predator increases the prey decreases (-) and that
as the prey increases the predator increases too (+).

For our logistic growth the signs in the Jacobian change to:

Because now there’s a negative self-dependence for the prey-as its numbers increase its rate of growth decreases. This makes the trace (τ) of the Jacobian negative and the determinant positive, which implies that our system is now a stable system. Plotting the exact same dynamic system but now including a carrying capacity, we can see how the two populations converge to specific numbers.

September 19, 2017 by David Gold

Let your Makefile make your life easier

This semester I’m taking my first official CS class here at Cornell, CS 5220 Applications of Parallel Computers taught by Dave Bindel (for those of you in the Reed group or at Cornell, I would definitely recommend taking this class, which is offered every other year, once you have some experience coding in C or C++). In addition to the core material we’ve been learning in the class, I’ve been learning a lot by examining the structure and syntax of the code snippets written by the instructor and TA that are offered as starting points and examples for our class assignments. One element in particular that stood out to me from our first assignment was the many function calls made through the makefile. This post will first take a closer look into the typical composition of a makefile and then examine how we can harness the structure of a makefile to help improve workflow on complicated projects.

Dissecting a typical makefile

On the most basic level, a makefile simply consists of series of rules that each have an associated set of actions. Makefiles are how you use the “make” utility, a software package installed on all linux systems. Make has its own syntax similar to bash but with some distinct idiosyncrasies. For example, make allows you to store a snippet of code in whats called a “macro” (these are pretty much analogous to variables in most other languages). A macro to store the flags you would like to run with your compiler could be defined like this:

CFLAGS  = -g -Wall

To referenced the CFLAGS macro, use a dollar sign and brackets, like this:

 $(CFLAGS)

There are a series of “special” predifined macros that can be used in any makefile which are fairly common, you can find them here.

Now that we’ve discussed makefile syntax, lets take a look at how rules are structured within a makefile. A rule specified by a makefile has the following shape:

target: prerequisites
    recipe
    ...
    ...

The target is usually the name of the file that is generated by a program, for example an executable or object file. A prerequisite is the specified input used to create the target (which can often depend on several files). The recipe is the action that make carries out for the intended target (note: this must be indented at every line).

For example, a rule to build an executable called myProg from a c file called myProg.c using the gcc compiler with flags defined in CFLAGS might look like this:

myProg: myProg.c
    gcc $(CFLAGS) $<

Make the makefile do the work

The most common rules within makefiles call the compiler to build code (hence the name “makefile”) and many basic makefiles are used for this sole purpose. However, a rule simply sends a series commands specified by its recipe to the command line and a rule can actually specify any action or series of actions that you want. A ubiquitous example of a rule that specifies an action is “clean”, which may be implemented like this:

clean:
    rm -rf *.o $(PROGRAM)

Where PROGRAM is a macro containing the names of the executable compiled by the makefile.

In this example, the rule’s target is an action rather than an output file. You can call “clean” by simply typing “make clean” into the command line and you will remove all .o files and the executable PROGRAM from your working directory.

Utilizing this application of rules, we can now have our makefile do a lot of our command line work for us. For example, we could create a rule called “run” which submits a series of pbs jobs into a cluster.

run:
    qsub job-$*.pbs

We can then enter “make run” into the command line to execute this rule, which will submit the .pbs jobs for us (note that this will not perform any of the other rules defined in the makefile). Using this rule may be handy if we have a large number of jobs to submit.

Alternatively we could make a rule called “plot” which calls a plotting function from python:

plot: 
    python plotter.py $(PLOTFILES)

Where PLOTFILES is a macro containing the name of files to be plotted and plotter.py is a python function that takes the file names as input.

Those are just two examples (loosely based on a makefile given in CS 5220) of how you can use a makefile to do your command line work for you, but the possibilities are endless!! Ok, maybe that’s getting a bit carried away, but I do find this functionality to be a simple and elegant way to improve the efficiency of your workflow on complex projects.

For some background on makefiles, I’d recommend taking a look a Billy’s post from last year. I also found the GNU make user manual helpful as well as this tutorial from Swarthmore that has some nice example makefiles.

September 12, 2017 by Jazmin Zatarain

Get your research workflow on

I have done some research on research workflow and that includes interviewing some of my peers at Cornell grad school to get a sense of what increases their productivity and what are their strategies for accomplishing long-term research goals. In addition to this, I also gathered good advice from my PI, who is the most ultra-efficient human that I know. Had I taken the following advice, I would’ve written this blog post a week ago.

The Get your research workflow on series, consists of two parts:

Part 1 covers General research workflow tips and Part 2. Setting up your technical workflow for people training in with the Decision Analytics crew (a.k.a. the best crew in town).

General research workflow tips

Disclosure: some of the contents of this list may be easier said than done.

First of all, a research workflow can be very personal and it is definitely tailored to each person’s requirements, personality and interests, but here are some general categories that I think every researcher can relate to:

Taking notes, organizing and reflecting on ideas

I was gifted with the memory of a tuna fish, so I need to take notes for everything. Unfortunately, taking notes with paper notebooks resulted in disaster for me in the past, it was very hard to keep information organized, and occasionally my notebooks would either disappear or get coffee stains all over. Luckily, my office mate Dave, introduced me to the best application for note taking ever: Evernote, this app allows you to keep your notes categorized, so you can keep the information that you need indexed and searchable across every single platform you have, that means that you can keep your notes synchronized with your smartphone, laptop, desktop, etc, and have it accessible anywhere you go.

In addition, the Evernote web clipper tool allows you to save and categorize articles or webpages within your notes and make annotations on them. Additionally, you can tag your notes, this is useful if you have notes that could fit into multiple notebooks. You can also share and invite people to edit notes and you can connect it with Google drive. I would probably still flock to Google docs or Dropbox Paper for collaborative documents, but for personal notes, I prefer the Evernote interface. There’s no limit on the amount of notebooks that you can have. I’ve found this app very useful for brainstorming and developing ideas, I also use it to keep a research log to track my research progress.

Reading journal papers and reference management

Keeping up with the scientific literature can be very challenging specially with the overwhelming amount of journal papers out there, but you can make things manageable for yourself if you find a reference manager that allows you to build a library that makes it easy to find, add, organize, read, prioritize and annotate papers that you can later cite. Additionally, you may want to set up smart notifications about new papers on topics that interest you, and get notified via e-mail. A couple of popular free and open source reference managers that allow you to do the previous are Zotero and Mendeley, also Endnote basic, its free but you would need to upgrade to Endnote Desktop for unlimited storage. These reference managers also allow you to export BibTex files for its integration with LaTeX. You can check out the Grand Reference Management Comparison table for all the reference management software available out there.

In addition to reference manager software, a couple of popular subscription-based multidisciplinary databases are Web of Science and Scopus, they differ from Google scholar, by the fact that these are human curated databases, they are selected by scholarly and quality criteria by literature review committees, and they let you build connections between topics.

Finally, I came across this article on How to keep up with the scientific literature, where a number of scientists were interviewed on the subject, and they all agree that it can be overwhelming but it is key to stay up to date with the literature as its the only way to contextualize your work and identify the knowledge gaps. The article provided advice on how to prioritize what to read despite the overwhelming amount of scientific literature.

Time management and multi-tasking

This is my Achilles heel, and its a skill that requires a lot of practice and discipline. Sometimes, progress in research can seem hard to accomplish, specially when you are involved in several projects, dealing with hard deadlines, taking many classes, TA-ing, or you’re simply busy being a socialité, but there are several tricks to be on top of research while avoiding getting overwhelmed in a multi-tasking world. Some, or most of this tips came from a time-manager master-mind:

Tip # 1. Schedule everything and time everything

Schedule everything from hard, set-in-stone deadlines to casual meetings, that way you’ll know for sure how much time you’ll have to spare on different projects and you can block time for those projects in a weekly basis. Keep track of the time that you spend on different projects/tasks. There’s a very popular app among 3 economists, Julie’s brother and my friend Justyna called be focused that allows you to manage tasks and time them. You can use it to keep track, for instance, of the time it takes you to read a paper, some people use it to time the time it takes them to write a paper till completion, right now I’m tracking the amount of time its taking me to write this blogpost. Timing everything will allow you to get better at predicting the time it will take you to accomplish something and reflect on how you can improve. I always tend to underestimate my timings but this app is giving me a reality check.. very annoying.

Tip # 2. Different mindsets for different time slots

When your schedule is full of small time gaps, fill them doing tasks that involve less concentration, probably reading, answering e-mails, organizing yourself, and leave larger time slots for the most creative and challenging part of your work.

Also, a general recommendation of multi-tasking is don’t do it, trying to do multiple things at once can hurt your productivity, instead, block times to carry specific tasks, were you focus on that one thing, and then move on to the next. Remember to prioritize and tackle the most important thing first.

Tip #4. Visualize long-term research goals and work backwards

Picture what you want to accomplish in a year or in a semester, and work your way backwards, till you refine the accomplishments of each month, each week and each day to hit your long-term target. Setup to-do lists for the week and for the day.

Tip #3. Set aside time for new skills that you want to acquire

Even if you set aside one or two hours a week devoted to that skill that you want to develop, it will pay off, you’ll have come a long way at the end of the year. Challenge yourself and continue to develop new skills.

Tip #5. Don’t leave e-mails sitting in your inbox

There are a couple of strategies for this, you can either allocate time each day specifically for replying to e-mails or you can tackle each e-mail as it comes. If it’s something that will require you more time, move it to a special list, if it’s a meeting, put it in your calendar, if it’s for reference, save it. No matter what your strategy is, take action on an e-mail as soon as you read it.

Collaborative work

Some tools for collaborative work :

Overleaf– for writing LaTeX files collaboratively and visualizing the changes live, the platform has several journal templates and can track changes easily.

Github– platform for collaborative code development and management.

Slack – organize conversations with teams, and organize your collaborative workflow

A final recommendation is to have a consistent and intuitive organization of your research. Document everything, and have reproducible code. If you get hit by a bus and your colleagues are able to continue research were you left off in less than a week, then you’re in good shape, organization-wise.

I hope this helps, let me know if there are some crucial topics that I missed, I can always come back and edit.

Special thanks to all of my grad/postdoc friends that participated in the brief research workflow interview.

Water Programming: A Collaborative Research Blog

Tips and tricks on programming, evolutionary algorithms, and doing research

Month: September 2017