PDFExtract: Get a list of BibTeX references from a scholarly PDF

So you’ve found a review article with a great list of references that you’d like to include in your own paper/thesis/etc. You could look them up, one-by-one, on Google Scholar, and export the citation format of your choice. (You could also retype them all by hand, but let’s assume you’re savvy enough to use some kind of citation manager).

This is not a great use of your time.

Check out PDFExtract, a Ruby library written by folks at CrossRef. Its goal is to read text from a PDF, identify which sections are “references”, and return this list to the user. As of recently, it has the ability to return a list of references in BibTeX format after resolving the DOIs over the web. When the references in the PDF are identified correctly (about 80-90% of the time in my experience), you’ll now have all the references from that paper to do with as you please—to cite in LaTeX, or import to Zotero, etc.

How to use it

You will need a recent version of Ruby and its gem package manager. Search around for how to do this on your particular OS. As usual, this will be a lot easier on *nix, but I have it working in Cygwin too so don’t despair.

The latest version of PDFExtract (with BibTeX output) is not on the central gem repository yet, but for now you can build and install from source:

git clone https://github.com/CrossRef/pdfextract
cd pdfextract
gem build pdf-extract.gemspec
gem install pdf-extract-0.1.1.gem  # check version number

You should now have a program called pdf-extract available from the command line. Navigate to a directory with a PDF whose references you’d like to extract, and run the following:

pdf-extract extract-bib --resolved_references MyFile.pdf

It will take a minute to start running, and then it will begin listing the references it finds, along with their resolved DOIs from CrossRef’s web API, like so:

Found DOI from Text: 10.1080/00949659708811825 (Score: 5.590546)
Found DOI from Text: 10.1016/j.ress.2011.10.017 (Score: 4.6864557)
Found DOI from Text: 10.1016/j.ssci.2008.05.005 (Score: 0.5093678)
Found DOI from Text: 10.1201/9780203859759.ch246 (Score: 0.6951939)
Found DOI from Text: 10.1016/s0377-2217(96)00156-7 (Score: 5.2922735)
...

Note that not all resolutions are perfect. The score reflects the degree of confidence that the reference extracted from the PDF matches the indicated DOI. Scores below 1.0 will not be included in the final output, as they are probably incorrect.

Go make yourself a coffee while it searches for the rest of the DOIs. Eventually it will move to the second phase of this process, which is to use the DOI to obtain a full BibTeX entry from the web API. Again, this will not be done for DOIs with scores below 1.0.

Found BibTeX from DOI: 10.1080/00949659708811825
Found BibTeX from DOI: 10.1016/j.ress.2011.10.017
Found BibTeX from DOI: 10.1016/s0377-2217(96)00156-7
Found BibTeX from DOI: 10.1016/j.ress.2006.04.015
Found BibTeX from DOI: 10.1111/j.1539-6924.2010.01519.x
Found BibTeX from DOI: 10.1002/9780470316788.fmatter
...

Finish your coffee, check your email, and chuckle at the poor saps out there gathering their references by hand. When the program finishes, look for a file called MyFile.bib—the same filename as the original PDF—in the same directory from which you invoked the pdf-extract command. Open it up in a text editor or reference manager and take a look. Here’s the output from my example:

@article{Archer_1997,
doi = {10.1080/00949659708811825},
url = {http://dx.doi.org/10.1080/00949659708811825},
year = 1997,
month = {May},
publisher = {Informa UK Limited},
volume = {58},
number = {2},
pages = {99-120},
author = {G. E. B. Archer and A. Saltelli and I. M. Sobol},
title = {Sensitivity measures,anova-like Techniques and the use of bootstrap},
journal = {Journal of Statistical Computation and Simulation}
}
@article{Auder_2012,
doi = {10.1016/j.ress.2011.10.017},
url = {http://dx.doi.org/10.1016/j.ress.2011.10.017},
year = 2012,
month = {Nov},
publisher = {Elsevier BV},
volume = {107},
pages = {122-131},
author = {Benjamin Auder and Agn\`es De Crecy and Bertrand Iooss and Michel Marqu\`es},
title = {Screening and metamodeling of computer experiments with functional outputs. Application to thermal$\textendash$hydraulic computations},
journal = {Reliability Engineering \& System Safety}
}

... (and many more!)

A few extra-nice things: (1) it includes all DOIs, which journals sometimes require and are pesky to track down, and (2) it attempts to escape all BibTeX special characters by default. Merge this with your existing library, and be happy! (You could even use this to recover or develop a reference library from your own papers!)

Caveats

  • This works a lot better on journal articles than on longer documents like theses and textbooks. It assumes that the “Reference” section is toward the end, so a chapter-based or footnote-based reference format will cause it to choke.

  • It will not work on non-digital articles—for example, older articles which were scanned and uploaded to a journal archive.

  • Careful with character encoding when you are importing/exporting BibTeX with other applications (like Zotero), or even managing the file yourself. You may want to look for settings in all of your applications that allow you to change the character encoding to UTF-8.

  • Lots of perfectly good references do not have DOIs and thus will not be resolved by the web API. This includes many government agency reports, for example. In general do not expect to magically BibTeXify things other than journal articles and the occasional textbook.

  • Reading a PDF is tricky business—there are some journal formats that just won’t work. You will notice failures based on (1) consistently bad DOI resolution scores, (2) complete failure with an error message from the PDF reader (very hard to trace these), or (3) if your BibTeX file contains bizarre entries at the end. I’ve accidentally “extracted” references about ornithology, for example—just delete these and move on.

8 thoughts on “PDFExtract: Get a list of BibTeX references from a scholarly PDF

  1. Do you have any advice for someone who cannot get the “gem build pdf-extract.gemspec” step to work? I end up with a segmentation fault and the gem file produced is empty. I’m using cygwin, and I’m not familiar with Ruby, so that might be part of the problem. The github page hasn’t been too helpful in this respect.

  2. Hm this could be tough to track down. Could you post the error message, or some relevant parts of it? The library on github hasn’t changed in a while, and back when I wrote this post I was using Cygwin too so it’s at least theoretically possible. Also, what versions of gem and Ruby? (just “gem –version” on the command line)

    • I’m running version 2.3.0 of Gem and Ruby 2.2.2p95. The error that was produced seems like it contains more warnings than anything else. I’m not sure how helpful it is. I tried changing the syntax to what the message recommended, but that was no help.

      $ gem build pdf-extract.gemspec
      WARNING: licenses is empty, but is recommended. Use a license abbreviation from:
      http://opensource.org/licenses/alphabetical
      WARNING: no description specified
      WARNING: open-ended dependency on nokogiri (>= 1.5.0) is not recommended
      if nokogiri is semantically versioned, use:
      add_runtime_dependency ‘nokogiri’, ‘~> 1.5’, ‘>= 1.5.0’
      WARNING: open-ended dependency on prawn (>= 0.11.1) is not recommended
      if prawn is semantically versioned, use:
      add_runtime_dependency ‘prawn’, ‘~> 0.11’, ‘>= 0.11.1’
      WARNING: open-ended dependency on sqlite3 (>= 1.3.4) is not recommended
      if sqlite3 is semantically versioned, use:
      add_runtime_dependency ‘sqlite3’, ‘~> 1.3’, ‘>= 1.3.4’
      WARNING: open-ended dependency on commander (>= 4.0.4) is not recommended
      if commander is semantically versioned, use:
      add_runtime_dependency ‘commander’, ‘~> 4.0’, ‘>= 4.0.4’
      WARNING: open-ended dependency on json (>= 1.5.1) is not recommended
      if json is semantically versioned, use:
      add_runtime_dependency ‘json’, ‘~> 1.5’, ‘>= 1.5.1’
      WARNING: open-ended dependency on rb-libsvm (>= 1.1.3) is not recommended
      if rb-libsvm is semantically versioned, use:
      add_runtime_dependency ‘rb-libsvm’, ‘~> 1.1’, ‘>= 1.1.3’
      WARNING: open-ended dependency on mongo (>= 1.9.2, development) is not recommended
      if mongo is semantically versioned, use:
      add_development_dependency ‘mongo’, ‘~> 1.9’, ‘>= 1.9.2’
      WARNING: open-ended dependency on bson_ext (>= 1.9.2, development) is not recommended
      if bson_ext is semantically versioned, use:
      add_development_dependency ‘bson_ext’, ‘~> 1.9’, ‘>= 1.9.2’
      WARNING: open-ended dependency on rake (>= 10.1.0, development) is not recommended
      if rake is semantically versioned, use:
      add_development_dependency ‘rake’, ‘~> 10.1’, ‘>= 10.1.0’
      WARNING: See http://guides.rubygems.org/specification-reference/ for help
      Segmentation fault (core dumped)

      Thank you for your help!

  3. Pingback: PDF Annotation Related Tools – Emacs, Arduino, Raspberry Pi, Linux and Programming etc

  4. Thanks a lot for your very helpful blog post!

    I am trying to use PDFExtractor for one of my projects but can’t get it to work, producing CommandErrors and LoadErrors.

    I am wondering in what kind of Ruby environment you’re running PDFExtract? (I tried both ruby 2.2.4, which produced the Errors, and 1.9.3, which was incompatible with some “prawn” dependency.)

    Is this just me not setting up PDFExtract in the right way or has it been abandoned?

    Thanks! 🙂
    Basanta

  5. Pingback: Water Programming Blog Guide (3) – Water Programming: A Collaborative Research Blog

Leave a comment