PDFExtract: Get a list of BibTeX references from a scholarly PDF

So you’ve found a review article with a great list of references that you’d like to include in your own paper/thesis/etc. You could look them up, one-by-one, on Google Scholar, and export the citation format of your choice. (You could also retype them all by hand, but let’s assume you’re savvy enough to use some kind of citation manager).

This is not a great use of your time.

Check out PDFExtract, a Ruby library written by folks at CrossRef. Its goal is to read text from a PDF, identify which sections are “references”, and return this list to the user. As of recently, it has the ability to return a list of references in BibTeX format after resolving the DOIs over the web. When the references in the PDF are identified correctly (about 80-90% of the time in my experience), you’ll now have all the references from that paper to do with as you please—to cite in LaTeX, or import to Zotero, etc.

How to use it

You will need a recent version of Ruby and its gem package manager. Search around for how to do this on your particular OS. As usual, this will be a lot easier on *nix, but I have it working in Cygwin too so don’t despair.

The latest version of PDFExtract (with BibTeX output) is not on the central gem repository yet, but for now you can build and install from source:

git clone https://github.com/CrossRef/pdfextract
cd pdfextract
gem build pdf-extract.gemspec
gem install pdf-extract-0.1.1.gem  # check version number

You should now have a program called pdf-extract available from the command line. Navigate to a directory with a PDF whose references you’d like to extract, and run the following:

pdf-extract extract-bib --resolved_references MyFile.pdf

It will take a minute to start running, and then it will begin listing the references it finds, along with their resolved DOIs from CrossRef’s web API, like so:

Found DOI from Text: 10.1080/00949659708811825 (Score: 5.590546)
Found DOI from Text: 10.1016/j.ress.2011.10.017 (Score: 4.6864557)
Found DOI from Text: 10.1016/j.ssci.2008.05.005 (Score: 0.5093678)
Found DOI from Text: 10.1201/9780203859759.ch246 (Score: 0.6951939)
Found DOI from Text: 10.1016/s0377-2217(96)00156-7 (Score: 5.2922735)
...

Note that not all resolutions are perfect. The score reflects the degree of confidence that the reference extracted from the PDF matches the indicated DOI. Scores below 1.0 will not be included in the final output, as they are probably incorrect.

Go make yourself a coffee while it searches for the rest of the DOIs. Eventually it will move to the second phase of this process, which is to use the DOI to obtain a full BibTeX entry from the web API. Again, this will not be done for DOIs with scores below 1.0.

Found BibTeX from DOI: 10.1080/00949659708811825
Found BibTeX from DOI: 10.1016/j.ress.2011.10.017
Found BibTeX from DOI: 10.1016/s0377-2217(96)00156-7
Found BibTeX from DOI: 10.1016/j.ress.2006.04.015
Found BibTeX from DOI: 10.1111/j.1539-6924.2010.01519.x
Found BibTeX from DOI: 10.1002/9780470316788.fmatter
...

Finish your coffee, check your email, and chuckle at the poor saps out there gathering their references by hand. When the program finishes, look for a file called MyFile.bib—the same filename as the original PDF—in the same directory from which you invoked the pdf-extract command. Open it up in a text editor or reference manager and take a look. Here’s the output from my example:

@article{Archer_1997,
doi = {10.1080/00949659708811825},
url = {http://dx.doi.org/10.1080/00949659708811825},
year = 1997,
month = {May},
publisher = {Informa UK Limited},
volume = {58},
number = {2},
pages = {99-120},
author = {G. E. B. Archer and A. Saltelli and I. M. Sobol},
title = {Sensitivity measures,anova-like Techniques and the use of bootstrap},
journal = {Journal of Statistical Computation and Simulation}
}
@article{Auder_2012,
doi = {10.1016/j.ress.2011.10.017},
url = {http://dx.doi.org/10.1016/j.ress.2011.10.017},
year = 2012,
month = {Nov},
publisher = {Elsevier BV},
volume = {107},
pages = {122-131},
author = {Benjamin Auder and Agn\`es De Crecy and Bertrand Iooss and Michel Marqu\`es},
title = {Screening and metamodeling of computer experiments with functional outputs. Application to thermal$\textendash$hydraulic computations},
journal = {Reliability Engineering \& System Safety}
}

... (and many more!)

A few extra-nice things: (1) it includes all DOIs, which journals sometimes require and are pesky to track down, and (2) it attempts to escape all BibTeX special characters by default. Merge this with your existing library, and be happy! (You could even use this to recover or develop a reference library from your own papers!)

Caveats

  • This works a lot better on journal articles than on longer documents like theses and textbooks. It assumes that the “Reference” section is toward the end, so a chapter-based or footnote-based reference format will cause it to choke.

  • It will not work on non-digital articles—for example, older articles which were scanned and uploaded to a journal archive.

  • Careful with character encoding when you are importing/exporting BibTeX with other applications (like Zotero), or even managing the file yourself. You may want to look for settings in all of your applications that allow you to change the character encoding to UTF-8.

  • Lots of perfectly good references do not have DOIs and thus will not be resolved by the web API. This includes many government agency reports, for example. In general do not expect to magically BibTeXify things other than journal articles and the occasional textbook.

  • Reading a PDF is tricky business—there are some journal formats that just won’t work. You will notice failures based on (1) consistently bad DOI resolution scores, (2) complete failure with an error message from the PDF reader (very hard to trace these), or (3) if your BibTeX file contains bizarre entries at the end. I’ve accidentally “extracted” references about ornithology, for example—just delete these and move on.

Zotero introduction (video)

I made a short Screenr video describing how to install and use Zotero for citation management in Word. Specifically, the Zotero standalone program plus the Chrome plugin. (Sorry about the breathing sounds in the microphone … I’ll work on that next time.)

http://screenr.com/aZ5s

EDIT: As far as I can tell, WordPress only allows embedding from certain video sites which do not include Screenr. So I guess you have to just open the link. Sorry about that.

Edit by Rachel: Here is a second Screenr video I created about Zotero that talks about importing/exporting citations as well as using the PDF search capability in Zotero. I recommend watching this video using a full screen so you can read the text.

Web-based Free Options for Bibliography Management and LaTeX Editing

I often find myself switching between computers with different operating systems, so I try to use free tools on the web as often as I can. The purpose of this post is to make you aware of two free options that I’ve had success with.

Bibliography Management – Zotero.org

Zotero is a free bibliography management resource that works as a plug-in for Mozilla Firefox along with plug-ins that work with Microsoft Office and Open Office. You edit your citations within Firefox, and insert them into documents using the Office plug-ins. You can import and export BibTeX into or out of Zotero and it is compatible with the RIS format, so you can move your citations back and forth between Zotero and Endnote. When you sign up for Zotero, it will ask you to create a user account. Your web account serves as an online backup for your citations, as well as a collaborative space. You can create a profile based on your area of expertise, so you can search for users with similar research interests as you and share your citations with them. (Perhaps this would be a good way to create a Pat Reed Group citation database?)

If this piqued your interest, I recommend checking out the quick start guide which shows some of the cool stuff you can do with Zotero.

My only warning is make sure you’re running the latest version of Firefox or you might have some compatibility issues with the plug-ins, especially with Word and Open Office. According to the website, there is a beta release for standalone Zotero as well as plug-ins for Safari and Chrome, but I haven’t used any of those options. It is also important to note that there is a 100MB limit for free Zotero service. I have about 2,000 citations total stored online and I’m only using about 1.0MB according to the website, so I imagine that the free service will be sufficient for everyone. It is $20/year for 1GB of Zotero storage.

LaTeX – Latexlab.org

Latexlab.org is a Google Docs based LaTeX editor. You sign in using your Google Docs account, so all your files are stored on your Google profile.  Those familiar with WinEdt or other LaTeX editing software should have no trouble using the LaTeX Lab interface.  You can upload images to your Google-docs account to insert them into your LaTeX document. I’d recommend using this if you’re on the go and need to put together a LaTeX document quickly.

I’ve never tried to compile anything complicated within LaTeX Lab, but if you need to put together an equation-heavy document quickly, this is a good alternative. I certainly wouldn’t try to put your thesis together using LaTeX Lab. You can compile different documents together into a project, but I’ve never used that functionality. Again, I would shy away from trying to put together complicated documents in LaTeX Lab.