PBS Batch Serial Job Submission with Slot Limit

The situation: You have an “embarrassingly parallel” set of tasks that do not need to communicate with each other. They are currently configured to run in serial, but their combined runtime would exceed the allowed queue time (or your patience).

Goal: Submit a batch of serial jobs to a PBS queue in a sensible way.

Your old job script, my_job.sh:

#PBS -l walltime=1:00:00
#PBS -l nodes=1:ppn=1


for number in ${1..1000}
  ./my_program -n ${number}

As usual, you would submit this with qsub my_job.sh.

New idea:
Most PBS systems support job arrays with the -t flag:

qsub my_job.sh -t 0-1000

The index of each job is passed into my_job.sh via the $PBS_ARRAYID environment variable, so that the job script can be revised:

#PBS -l walltime=1:00:00
#PBS -l nodes=1:ppn=1


./my_program -n $PBS_ARRAYID

Now you should have 1000 jobs running. They are “parallel” in the sense that they will run simultaneously, but are not using MPI. The jobs will appear as an array to save you the hassle of scrolling through all of them:

jdh366@thecube ~/demos-meeting/serial-batch
$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
117046[].thecube           my-job-script.sh jdh366                 0 R default  

The stdout from each job will be saved in files named my-job-script.sh.o117046-<i> where <i> is the index of the job.

What if I have more jobs than processors?
You have two choices. First, you could submit all of them at the same time. PBS will run as many jobs as it can until they are finished. However, this is likely to occupy the entire system with many short serial jobs, which people won’t like.

The second option is to specify a “slot limit” using the % character in the qsub command like so:

qsub my_job.sh -t 0-1000%50

This command will run a maximum of 50 serial jobs at a time. When these finish, it will run another 50 and so on until the 1000 are complete. This avoids using the entire system at once.

For more HPC submission script examples, see here:

Setting Borg parameters from the Matlab wrapper

In previous posts we talked about how to compile and run the Borg Matlab wrapper on Windows and OSX/Linux. To run the optimization after everything is compiled, the function call looks like this:

[vars, objs] = borg(11, 2, 0, @DTLZ2, 1000, zeros(1,11), ones(1,11), 0.01*ones(1,2), parameters);

In order, the arguments are:

- number of decision variables
- number of objectives
- number of constraints
- handle to function to be optimized
- number of function evaluations to perform
- decision lower bounds
- decision upper bounds
- epsilon values
- parameters

The last argument parameters is the topic of this post. This is a cell array that allows you to set the parameters without recompiling anything. For example, to set the random seed, you might do this:

parameters = {'rngstate', 12345, 'initialPopulationSize', 200, 'de.stepSize', 0.3}

before running the borg() function with parameters as the last argument. Notice this follows the Matlab cell array options format of {key, value, key, value ...}, so there must be an even number of items in the cell array.

Using the nativeborg.cpp file I tracked down all of the parameters (string keys) that the user can choose:

'rngstate': random seed
'frequency': (for saving approximation sets during runtime)

// these two are enumerated types in the C code, not advised to change them

// operator-specific parameters

That’s all, thanks to Liz Houle at Boulder for the suggestion.

latexdiff: “track changes” for LaTeX

If you’re looking to display changes to a LaTeX document in different colors, say for a paper or thesis revision, you might be interested in latexdiff. This is a command-line tool that most likely came packaged with your LaTeX distribution—try running latexdiff --version in your terminal to see if you have it.

The idea is simple, just run:

latexdiff original.tex revised.tex > diff.tex

Then compile diff.tex into a PDF just like you would any LaTeX file. The result should look something like this:

This is a great tool! It isn’t completely foolproof, and you will sometimes encounter errors when compiling diff.tex, especially if you’ve changed citations between the original and revised versions. These can be resolved manually by editing diff.tex, just keep track of the line numbers where the errors are occurring.

A more detailed summary of options for latexdiff is available here:

Thanks for reading!

Python – Extract raster data value at a point

Often when working with raster data, you’ll want to find the value of a variable at a particular point on the globe rather than plotting the entire map. This is especially true in water management applications, where the point of interest might be a city or watershed outlet.

Here is a minimum working example of how to do that. I’m using GDAL to work with the raster files (the prior post describes GDAL in more detail), along with geopy to perform lat/lon lookups for cities. (Geopy is not required for this, it just makes it much easier to find coordinates).

from __future__ import division
from osgeo import gdal
from geopy.geocoders import Nominatim
def get_value_at_point(rasterfile, pos):
gdata = gdal.Open(rasterfile)
gt = gdata.GetGeoTransform()
data = gdata.ReadAsArray().astype(np.float)
gdata = None
x = int((pos[0] - gt[0])/gt[1])
y = int((pos[1] - gt[3])/gt[5])
return data[y, x]
city = 'Ithaca, NY'
g = Nominatim().geocode(city, timeout=5)
p = (g.longitude,g.latitude)
print get_value_at_point('path/to/my/raster/file', p)
view raw raster-point.py hosted with ❤ by GitHub

The get_value_at_point function does all of the work. Note that the indices x and y inside the function are being converted to int, which means there may be some loss of accuracy (i.e. it will round to the nearest pixel). This is most likely not an issue, but is worth remembering. You could easily build on this function to explore values at a single point across multiple raster files (for example, comparing climate projections). Hope this helps!

Python – Clip raster data with a shapefile

The problem: you have some georeferenced raster data, but you’re only interested in the points that lie inside of a shapefile boundary. This could be a country, state, watershed, etc. — it doesn’t matter, as long as you have a shapefile to define the boundary.

(Aside: georeferenced raster data can come in a number of formats, some are listed here.)

The solution: we’re going to clip the raster data to remove points outside the shapefile boundary, and then plot it with Python.

The ingredients: If you’re following along, we’ll be using some CMIP5 precipitation projections at 5-minute spatial resolution, which you can find here. (In particular, the 2070 projections from the GFDL-CM3 model under the rcp45 emissions scenario, but any of them will do). You’ll also need a shapefile, so if you don’t have one handy you can grab the US states shapefile.

(If you’re interested in watershed boundary shapefiles, here are the latest HUC classifications from USGS. If you grab the WBD_National.zip file, it contains all of HUC{2,4,6,8,10,12} shapefiles. Be warned, it is large, and you may be able to find your specific watershed elsewhere.)

If we plot the raster data—in this case, projected average August precip in millimeters—on top of the shapefile, we’ll get something like this, where the data covers all of North America:


I’ll put an example of making this plot with Python+Basemap at the end of the post.

To clip the raster with our US shapefile, we’ll be using GDAL, a collection of tools for manipulating spatial data. On Windows, the GDAL binaries and Python wrapper libraries all come bundled with the Python(x,y) installation, which is what I’m using. On *nix, pip install gdal should achieve the same thing, but you may need to fiddle around with base packages and installation settings.

Anyway, once you have the GDAL binaries, go to your command line and run this:

gdalwarp -cutline path/to/states.shp \
-crop_to_cutline -dstnodata "-9999.0" \
my_raster_file.tif \

The -dstnodata option specifies the “nodata” value to use in the destination file, where points outside the shapefile will be masked. If all goes well, you should now have my_raster_file_usa.tif which you can plot:


Which is the same data, of course, but clipped! You should be able to do this with any state, county, or watershed you want, as long as the resolution of the raster isn’t coarser than the area of the shapefile (in which case you would only be clipping one pixel).

As promised, here’s how to make these plots in Python. The line cmap.set_under('1.0') is deceptively important, because it sets all the “nodata” values in the file to a white background (remember, we used -9999.0 as the nodata value). Thanks for reading!

Update 3/14/17: Ben asked a good question about how this approach will treat pixels along the edge of the shapefile. Some visual investigation suggests that it (1) removes all pixels that extend beyond the border, and (2) adds a “border” of zero-valued pixels along the edge. It does not seem to aggregate the fractional pixels.

[gist jdherman/7434f7431d1cc0350dbe]

Introduction to mpi4py

If you like parallel computing, and you like Python, chances are you ‘ll like mpi4py. It’s a Python package that provides MPI bindings, which is very helpful if you’re trying to pass arrays or dictionaries between processors without worrying about the details as you would in a lower level language. Installation can be tricky, but we already have mpi4py installed on the Cube cluster for those of you who have accounts.

Getting started is easy:

from mpi4py import MPI
print "Hello from rank %d out of %d !" % (comm.rank, comm.size)
comm.Barrier() # wait to sync here (not needed for this example)

Then in your submission script:

#PBS -l walltime=1:00:00
#PBS -l nodes=4:ppn=16
#PBS -j oe

mpirun python myfile.py

And the output:

Hello from rank 12 out of 64 !
Hello from rank 14 out of 64 !
Hello from rank 13 out of 64 !
Hello from rank 51 out of 64 !
Hello from rank 52 out of 64 !

Let’s try something more interesting. In this example we’ll do a set of parallel runs of a very simple linear rainfall-runoff model (one bucket, where dS/dt = -kS). The only free parameter is k, so we’ll sample a range of values between (0,1) based on the rank of each node.

(Aside: this will be an example of a model built on top of stockflow, which is a Python wrapper for solving system dynamics ODEs. You can read more about the linear reservoir example in this notebook.)


from __future__ import division
from stockflow import simulation
import numpy as np
from mpi4py import MPI


Set up the model, with only one state variable (or “stock”):

# Model setup - linear reservoir
tmin = 0
tmax = 365
dt = 1
t = np.arange(tmin,tmax,dt)

data = np.loadtxt('leaf-river-data.txt', skiprows=2)
data_P = data[tmin:tmax,0]
data_PET = data[tmin:tmax,1]
data_Q = data[tmin:tmax,2]

s = simulation(t)
s.stocks({'S': 0})

Based on the processor’s rank, assign a value of the parameter k:

k = (comm.rank+1)/comm.size

Define flows, run the model, and calculate RMSE:

# Flows: precip, ET, and streamflow
s.flow('P', start=None, end='S', f=lambda t: data_P[t])
s.flow('ET', start='S', end=None, f=lambda t: min(data_PET[t], s.S))
s.flow('Q', start='S', end=None, f=lambda t: k*s.S)

RMSE = np.sqrt(np.mean((s.Q-data_Q)**2))

At this point we have completed model results sitting on every processor, but we’d like to collect them all on the root node to do some analysis, for example to find the best RMSE value. We can do this with MPI’s gather operation. There is a great beginner tutorial here describing the basic MPI operations.

Qs = comm.gather(s.Q, root=0)
RMSEs = comm.gather(RMSE, root=0)
ks = comm.gather(k, root=0)

We’ve now collected all of the results onto the root node (0). Let’s find the best and worst values of k and print them.

if comm.rank==0:
  best = np.argmin(RMSEs)
  worst = np.argmax(RMSEs)
  print "best k = %f" % ks[best]
  print "best rmse = %f" % RMSEs[best]
  print "worst k = %f" % ks[worst]
  print "worst rmse = %f" % RMSEs[worst]

Pretty simple! If we run this with the following job script:

#PBS -l walltime=1:00:00
#PBS -l nodes=16:ppn=16
#PBS -j oe


mpirun python linear_reservoir.py

… it runs on 256 processors almost instantly, and we see this output:

best k = 0.183594
best rmse = 2.439382
worst k = 1.000000
worst rmse = 4.470532

Then we can come back locally and plot the best and worst hydrographs to see what they look like:


This looks a little messy, but you can see that the “best” model run matches the observed data much more closely than the “worst”, which overshoots the peaks considerably.

That’s all for now. Happy HPCing!

Scientific figures in Illustrator

Once you have a figure sequence almost ready to publish in a journal, it’s time to make your figures look good. This may sound like a vain exercise, but if you consider that the journal article will be around for a long time, it’s worth it. Publication-quality scientific figures are typically in vector format (SVG, PDF, or EPS files) rather than raster format (JPEG, PNG, TIFF, and others) which may become pixelated if resized. You can read this discussion of raster vs. vector formats if you’re interested, or just take my word for it that vector is what you want.

Programs capable of editing vector graphics include Adobe Illustrator (paid) and Inkscape (free). This post will be about Illustrator, but keep in mind that Inkscape offers much of the same functionality. You can not edit vector images with photo editing programs like Photoshop or GIMP, nor can you make publishable vector images with MS Office programs like Excel or Powerpoint. This post will take a figure created in Matlab and clean it up into something that will look nice in a publication using Illustrator.

If you run the following Matlab code …

x = 0:0.1:5;

y(1,:) = exp(-1*x);
y(2,:) = exp(-0.5*x);
y(3,:) = exp(-1.5*x);

h = plot(x,y);
legend(h, {'Thing 1', 'Thing 2', 'Thing 3'});
grid on;
xlabel('Time (hours)');
ylabel('Amount of thing');
title('My plot title');

… it will give you this figure:

default matlab figure style

Default figure style. Bleh.

You’ve seen this before: the text is too small, the grid lines are noisy, the colors are uninspiring, the plot lines are too thin … and so on. Some of those issues you can fix from inside Matlab, which is a good idea if you plan to regenerate your plot several times. (An example of doing this is available here in the Matlab Plot Gallery). But here we’re going to do all the work in Illustrator instead (which as you may have gathered is not a very “repeatable” process, so only do it once you agree with coauthors on the general layout of the figure). If you’re following along at home, save your Matlab figure in .eps format. Unfortunately Matlab does not have an option to save in SVG format as of this post, which is typically the option of choice in Python or R.

Go find the .eps figure you just created, and open it with Illustrator. If your file extensions aren’t associated with Illustrator, you may need to right-click and select Open With / Adobe Illustrator. When opening the file, you may see some warnings about missing fonts, just click OK and proceed. Note that we are not “inserting” the file into an existing Illustrator document—we are opening the file itself. You should be greeted by our lovely figure, inside of what may be an overwhelming interface the first time you see it. Let’s break down the toolbar on the left-hand side to start with, and we’ll cover other options as we go.

Illustrator left toolbar

These annotations were professionally done.

This is only the top of the toolbar, but you’ll be using these “tools” (cursor types) probably 95% of the time. The select cursor is the default, and when in doubt you should return to it. The text, line, and shape options are for adding new things to your drawing, which is not so different from, say, Powerpoint. Stay on “Select” for now. Let’s get cleaning.

Step 0: Ungroup the main parts of the figure

When Matlab exports EPS files, it tends to group different parts of the figure together. This can be somewhat unpredictable and makes editing difficult. You’ll notice if you left-click any part of the figure, it will select everything. So, one of the first things you’ll want to do when editing almost any figure is to “ungroup”, which you can do by right-clicking any part of the figure and selecting the ungroup option:

Illustrator ungrouping

How to ungroup the elements of the figure (everything is grouped by default).

Now you should be able to select individual parts of the figure. Matlab may have saved a big white rectangle in the background of your figure (you can see it selected around the border of the above image), which you should feel free to delete. Illustrator’s white “Artboard” will serve as our background, and you can edit the artboard size by pressing Shift+O.

Step 1: Replace all text

This may sound harsh, but hear me out. What we really want to do is just resize the text, or maybe change the font type, nothing major. But what usually happens here is that the EPS file splits text objects into separate boxes, especially when a decimal point is involved. Which means if you resize, you get the following:

Illustrator text resize

The text elements are split into two, so resizing won’t work. Need to start from scratch.

This is not an indictment of Matlab specifically—in fact, when Matplotlib (Python) exports SVGs, it usually saves text as vector paths, which aren’t fonts at all! Rather than trying to repair this strangeness, just replace all of the text yourself. Choose a clean sans-serif font, unless you want to use Computer Modern serif for equations. I like Gill Sans for figures (Gill Sans MT on Windows), but your mileage may vary. My rule of thumb for text sizing is: Title > Axis labels = Legend > Tick labels, something like this:

Illustrator fonts text sizes

Same figure with fonts and text sizes fixed.

That’s already so much better, just by improving the text readability. It looks like we actually spent some time on it, instead of just copypasting the raw Matlab output. You can insert text using the text cursor (shown on the toolbar figure above), and edit its properties from the Window/Type/Character window, pictured below, which you’ll usually want to keep visible.

Illustrator character window

The Type/Character window is where you change font types, sizes, and tracking.

A quick digression: the other window you’ll usually want to keep open is the Align window, which looks like this:

Illustrator align window

The align window is a must for lining up multiple elements. Don’t try to do it by hand.

Those little button symbols are actually pretty intuitive. If you select two or more objects, then press one of these “alignment” buttons, it will either left-, center-, or right-align them horizontally (the left three buttons) or vertically (the right three buttons). This comes in handy when you’re trying to line up tick labels that you’ve edited manually, or combining multiple EPS files as subplots in a larger figure file and you want to make sure they’re aligned properly. Even if you don’t learn how to do this right now, just remember that alignment shouldn’t be done manually.

Ok, digression over. When resizing text, pay particular attention to the aspect ratio (the width:height ratio of the object). Never ever resize these out of proportion. Either use the text size in the character window, shown above (recommended), or Shift+Drag when you resize, which will preserve the aspect ratio.

Don't change the aspect ratio

If you take one thing away from this post …

Step 2: Make grid lines solid and lighter

Right now, the grid lines are noisy and distracting—let’s clean them up. Fortunately the grid lines should all be grouped together, so we don’t need to do any fancy selection to get them all selected at the same time. Just left-click on any one of the grid lines and you should see them all highlighted. Then, on the right-hand toolbar, go to the “Stroke” options and deselect the “Dashed Line” checkbox:

Illustrator line weight

Changing line weight in Illustrator.

(I’m not doing the snapshots in parallel with the actual saved figure versions, so the font formatting isn’t shown in this snapshot). This should turn the grid lines into solid lines. But wait, they’re solid black! This is even more distracting than before! Have no fear, just go to the “Color” options on the right-hand toolbar and knock it down to maybe 30% gray or so:

Illustrator stroke color

Changing stroke color in Illustrator.

While we’re at it, let’s also bump up the thickness of the plot lines. To select the plot lines, you will need to Ctrl+click, since they are grouped (somehow, bizarrely) with the plot itself. You can Ctrl+Shift+click to select the other lines after that, so that you have all three selected at once. Then go back to the “Stroke” options on the right-hand toolbar (pictured in one of the above images) and increase the line weight to 2px. You should now have a figure that looks something like this:

Example plot after some improvements

We’re getting there … thicker lines, thinner grid, and nicer fonts. Only a few steps to go.

This is already a terrific improvement over the original version. I would be fine with publishing it at this point. But, there are a few other things we can do to make it even better.

Step 3: Care about Colors

As we saw in the very first image, Matlab’s default colors are red, green, and blue. This is fine, but you’ll find if you move beyond the defaults, people appreciate the extra effort. Illustrator makes the colors a little less grating when it imports the EPS file, because it converts them to CMYK format and does some magic behind the scenes that I don’t fully understand. But let’s imagine the case where the three lines represent something changing along a continuum (which they are—the exponential decay coefficient, to be exact). In this case, we might want their colors to also lie along a continuous spectrum.

I know of no better place to find discrete color palettes than Colorbrewer (http://colorbrewer2.org/). I am certainly not the first person to sing its praises, so you can go read about it elsewhere. Basically you choose whether your data are “Sequential”, “Diverging”, or “Qualitative”, give it the number of colors you want, and it generates a well-spaced set of colors for you to use. The case I just described would fall into the “Sequential” category, so let’s grab a set of 3 colors (for our three lines) from the single hue blues:

Colorbrewer palettes

Colorbrewer: if you’re not using it, you should be.

You can see in the bottom-right the RGB values for these three colors. You can change this to CMYK or HEX format if you prefer, but I’ll stick to RGBs in this example. Note if I were more responsible, I would have configured the colors of these lines in the Matlab script, so I could regenerate the figure as many times as I needed to. Oh well, we’re doing it in Illustrator now.

We want to change the color of each line, along with its legend entry, to correspond to the three colors that Colorbrewer gave us above. To do this quicker, we’ll use a neat selection trick. First Ctrl+click to select one of the lines. Then go to the menu option Select / Same / Stroke Color to add the line’s legend entry to your selection (see below). Now you can change the color of both of them at the same time!

Illustrator select same stroke color.

Illustrator has an option to select all other elements in the figure that have the same (appearance). In this case we want the same stroke color.

How do we set their color, you ask? Go back to the “Color” options on the right-hand toolbar. The color format might be set to something other than RGB, like CMYK. To change it to RGB, click the small options box in the upper-right corner and select “RGB”:

Illustrator RGB color format

Change the color format to RGB.

Then enter the RGB values from Colorbrewer. Repeat for the other two plot lines, and you’ll end up with something like this:

Illustrator color palette

Same figure as before, but with a “sequential” color palette. Useful if the lines represent some variable changing along a continuum.

Step 4: “Flat” Legend and Outer Box

At this point, I only have a few remaining grievances about this figure. First, the legend is too small. Second, there is a ton of whitespace in the upper right of the figure that is being wasted. Third, the solid black borders around the legend box, and around the plot itself, are distracting attention from the plot lines (which should be the focus, after all). Let’s see what we can do about this.

Select the legend and its components. Since we ungrouped everything at the beginning of this exercise, this will be annoying. Remember you can Shift+Click to select multiple elements at once. To make it easier to edit the legend, just move it outside the plot area for now. (Once you’re more experienced with Illustrator you can use layers to more easily edit “stuff” that’s sitting on top of other “stuff”).

To make the legend bigger (proportionally), it’s fine to select everything and Shift+Drag one of the corners of the bounding box. This will rescale the text proportionally as well. Make sure after you do this that the line thickness in the legend matches the thickness of the plot lines themselves (an important technicality), using the “Stroke” options in the right-hand toolbar.

Things get a bit confusing here because there may actually be two rectangles behind our legend—one defining the black border, and one defining the white background. If this seems to be the case in your figure (it is in mine), delete the black border completely, and then select the white background. Go to the “Color” options on the right-hand toolbar and change from white to a very light gray, maybe around 5% or so:

Illustrator change rectangle fill color

Changing the fill color of a rectangle.

See those two squares in the top-left of the Color options box? Those are the fill and stroke, respectively. You can toggle back and forth between the two by clicking them. The icon with the red line through it means there is “no fill” (or “no stroke”, respectively). The options for “none”, “black”, and “white” are always visible at the bottom because these are so common. Otherwise, you choose your own color. If you wanted something besides grayscale here, you could change the color format like we did before.

Take your enlarged legend and move it back inside the plot box. Here’s mine:

Illustrator flat legend

The “flat” legend at work. It solves our whitespace problem and is also nicer to look at.

Some may call it a fad, but I absolutely love these “flat” legends. Borders around legends are distracting, whereas a borderless light rectangle looks so clean. It can sit on top of the plot grid lines with no problem at all. And, by including the title inside the legend box, we’ve saved a bunch of whitespace at the top of the figure, too. I also rearranged the legend entries to go in the same vertical order as the plot lines—not required, but I think it makes sense here, especially if you pretend they’re called something besides “Thing 2”.

As a final step in our crusade against dark borders, let’s change the outer box to the same line style as the grid lines. For the record, I’m not completely sure I like the outcome of this, but it’s at least worth discussing. Select one of the black tick marks with Ctrl+Click (they are grouped with something else) and then use the Select / Same / Stroke Color to select all of the remaining black lines in the plot. See if you can remember how to change their line width to 0.333px (matching the grid lines), and their color to grayscale 30%. You should end up with something like this:

Example plot after fixing everything

The final result! Send it off to Nature and take a well-deserved break.

The border of the plot doesn’t announce itself anymore, but I consider this a good thing. The focus of the plot should be on the values being plotted, not on the box and grid lines. As I said before, this may be overkill, but now this seems like a polished, easily interpretable figure that I would be happy to see in a journal.

By the way, if you save your figure as a PDF and want to include it in a LaTeX document, it’s easy:

caption{My Caption.}

That’s all for this tutorial. Go back and look at the original plot at the top of this post, and then at the final product we just made—what a difference! Don’t despair if it takes you a while to do this, because eventually you’ll learn where all of the options are and you’ll get much faster. Vector graphics are definitely worth learning, partly for clear scientific communication, but also because they just plain look nice.

PDFExtract: Get a list of BibTeX references from a scholarly PDF

So you’ve found a review article with a great list of references that you’d like to include in your own paper/thesis/etc. You could look them up, one-by-one, on Google Scholar, and export the citation format of your choice. (You could also retype them all by hand, but let’s assume you’re savvy enough to use some kind of citation manager).

This is not a great use of your time.

Check out PDFExtract, a Ruby library written by folks at CrossRef. Its goal is to read text from a PDF, identify which sections are “references”, and return this list to the user. As of recently, it has the ability to return a list of references in BibTeX format after resolving the DOIs over the web. When the references in the PDF are identified correctly (about 80-90% of the time in my experience), you’ll now have all the references from that paper to do with as you please—to cite in LaTeX, or import to Zotero, etc.

How to use it

You will need a recent version of Ruby and its gem package manager. Search around for how to do this on your particular OS. As usual, this will be a lot easier on *nix, but I have it working in Cygwin too so don’t despair.

The latest version of PDFExtract (with BibTeX output) is not on the central gem repository yet, but for now you can build and install from source:

git clone https://github.com/CrossRef/pdfextract
cd pdfextract
gem build pdf-extract.gemspec
gem install pdf-extract-0.1.1.gem  # check version number

You should now have a program called pdf-extract available from the command line. Navigate to a directory with a PDF whose references you’d like to extract, and run the following:

pdf-extract extract-bib --resolved_references MyFile.pdf

It will take a minute to start running, and then it will begin listing the references it finds, along with their resolved DOIs from CrossRef’s web API, like so:

Found DOI from Text: 10.1080/00949659708811825 (Score: 5.590546)
Found DOI from Text: 10.1016/j.ress.2011.10.017 (Score: 4.6864557)
Found DOI from Text: 10.1016/j.ssci.2008.05.005 (Score: 0.5093678)
Found DOI from Text: 10.1201/9780203859759.ch246 (Score: 0.6951939)
Found DOI from Text: 10.1016/s0377-2217(96)00156-7 (Score: 5.2922735)

Note that not all resolutions are perfect. The score reflects the degree of confidence that the reference extracted from the PDF matches the indicated DOI. Scores below 1.0 will not be included in the final output, as they are probably incorrect.

Go make yourself a coffee while it searches for the rest of the DOIs. Eventually it will move to the second phase of this process, which is to use the DOI to obtain a full BibTeX entry from the web API. Again, this will not be done for DOIs with scores below 1.0.

Found BibTeX from DOI: 10.1080/00949659708811825
Found BibTeX from DOI: 10.1016/j.ress.2011.10.017
Found BibTeX from DOI: 10.1016/s0377-2217(96)00156-7
Found BibTeX from DOI: 10.1016/j.ress.2006.04.015
Found BibTeX from DOI: 10.1111/j.1539-6924.2010.01519.x
Found BibTeX from DOI: 10.1002/9780470316788.fmatter

Finish your coffee, check your email, and chuckle at the poor saps out there gathering their references by hand. When the program finishes, look for a file called MyFile.bib—the same filename as the original PDF—in the same directory from which you invoked the pdf-extract command. Open it up in a text editor or reference manager and take a look. Here’s the output from my example:

doi = {10.1080/00949659708811825},
url = {http://dx.doi.org/10.1080/00949659708811825},
year = 1997,
month = {May},
publisher = {Informa UK Limited},
volume = {58},
number = {2},
pages = {99-120},
author = {G. E. B. Archer and A. Saltelli and I. M. Sobol},
title = {Sensitivity measures,anova-like Techniques and the use of bootstrap},
journal = {Journal of Statistical Computation and Simulation}
doi = {10.1016/j.ress.2011.10.017},
url = {http://dx.doi.org/10.1016/j.ress.2011.10.017},
year = 2012,
month = {Nov},
publisher = {Elsevier BV},
volume = {107},
pages = {122-131},
author = {Benjamin Auder and Agn\`es De Crecy and Bertrand Iooss and Michel Marqu\`es},
title = {Screening and metamodeling of computer experiments with functional outputs. Application to thermal$\textendash$hydraulic computations},
journal = {Reliability Engineering \& System Safety}

... (and many more!)

A few extra-nice things: (1) it includes all DOIs, which journals sometimes require and are pesky to track down, and (2) it attempts to escape all BibTeX special characters by default. Merge this with your existing library, and be happy! (You could even use this to recover or develop a reference library from your own papers!)


  • This works a lot better on journal articles than on longer documents like theses and textbooks. It assumes that the “Reference” section is toward the end, so a chapter-based or footnote-based reference format will cause it to choke.

  • It will not work on non-digital articles—for example, older articles which were scanned and uploaded to a journal archive.

  • Careful with character encoding when you are importing/exporting BibTeX with other applications (like Zotero), or even managing the file yourself. You may want to look for settings in all of your applications that allow you to change the character encoding to UTF-8.

  • Lots of perfectly good references do not have DOIs and thus will not be resolved by the web API. This includes many government agency reports, for example. In general do not expect to magically BibTeXify things other than journal articles and the occasional textbook.

  • Reading a PDF is tricky business—there are some journal formats that just won’t work. You will notice failures based on (1) consistently bad DOI resolution scores, (2) complete failure with an error message from the PDF reader (very hard to trace these), or (3) if your BibTeX file contains bizarre entries at the end. I’ve accidentally “extracted” references about ornithology, for example—just delete these and move on.

RSS Feeds for Water Resources Journals

If you want to keep up with new journal publications, but don’t want to receive dozens of robo-email alerts every week, this post is for you. Go find an RSS reader of your choice and set up an account (some popular choices are The Old Reader, Feedly, and NewsBlur). Personally I use The Old Reader, but your mileage may vary. Feedly has a nice mobile app if that’s a requirement for you.


Once you set up your account, you can add subscriptions using the following list of links. These are sometimes hard to find depending on the publisher. If you want to add a journal that’s not listed here, searching Google for “<journal name> rss feed” is usually your best bet. Happy productivity!

NumPy vectorized correlation coefficient

Imagine that you have a matrix X with dimensions (N x k) and a vector y of size (1 x k), and you want to calculate the linear correlation coefficient between y and each row of X to obtain a vector of correlation coefficients, r, of size (N x 1) .

The first way you might consider doing this is in a loop with the np.corrcoef function, which returns the linear correlation coefficient between two vectors:

r = np.zeros((N,1))
for i in xrange(0,N):
    r[k] = np.corrcoef(X[i,:], y)[0,1]

If N is large, or if you need to perform this calculation many times (with an outer loop wrapped around it), this will be very slow. We can time it as follows:

for t in xrange(0,10): # average 10 timings

    start = time.time()
    r = np.zeros((10000,1))
    for k in xrange(0,10000):
        r[k] = np.corrcoef(vic_runoff[k,:], obs_runoff)[0,1]

    end = time.time()
    times[t] = end-start

print np.mean(times)

In this example with N = 10000, the average time is 0.48 seconds per calculation of the r vector. Too slow!

How can we speed this up? Some NumPy functions, such as sum and mean, have an axis argument to perform the operation along a particular dimension, but corrcoef does not (presumably because people often want the correlations between the rows of the matrix X, rather than the correlation of all rows with respect to a single vector). Recall the formula for the sample correlation coefficient between 1D vectors x and y:

linear correlation coefficient (from Wikipedia)

linear correlation coefficient (from Wikipedia)

We can write a function using NumPy’s vectorized arithmetic to compute these values all at once rather than in a loop. For example, np.multiply(X,y) (also given by X*y) performs element-wise multiplication of the vector y over all rows of the matrix X. The function might look something like this:

def vcorrcoef(X,y):
    Xm = np.reshape(np.mean(X,axis=1),(X.shape[0],1))
    ym = np.mean(y)
    r_num = np.sum((X-Xm)*(y-ym),axis=1)
    r_den = np.sqrt(np.sum((X-Xm)**2,axis=1)*np.sum((y-ym)**2))
    r = r_num/r_den
    return r

If we try timing it as before:

for t in xrange(0,10): # average 10 timings

    start = time.time()
    r = vcorrcoef(vic_runoff,obs_runoff)
    end = time.time()
    times[t] = end-start

print np.mean(times)

The average time is only 0.0017 seconds—a reduction of about 99.6%, or a 280x speedup compared to the non-vectorized approach. Not bad! It’s often said that Python is slow, but the matrix operations in NumPy can clearly be used to our advantage.