Writing fully reproducible papers is something everyone talks about but very few people actually do. Following nice examples I’ve seen developed by others (see here and here), I wanted to develop a GitHub template that I could easily use to organize the analysis I perform for each paper. I wanted it to be useful for the Reed group in general, but also anyone else who’d like to use it, so the version I’m presenting today is an initial version that will be adapted and evolve as our needs grow.
The template can be found here: https://github.com/antonia-had/paper_template and this blogpost will discuss its contents. The repository is set up as a template, so you can use “Import repository” when you create a new repository for your project or click on the green “Use this template” button on the top right.
The idea is that everything is organized and documented well so that another person can easily replicate your work. This will help with your own tools being more widely used and cited, but also future group members to easily pick up from where you left. The other selfish way in which this has helped me is that it forces me to spend some time and arrange things from the beginning so I can be more organized (and therefore more productive) during the project. Most importantly, when a paper does get accepted you don’t need to go back and organize everything so it looks halfway decent for a public repository. For these reasons I try to use a template like this from the early stages of a project.
A lot of the template is self explanatory, but I’ll go through to explain what is in it in short. The idea is you take it and just replace the text with your own in the README files and use it as a guide to organize your paper analysis and results.
There are directories to organize your content to code, data, and results (or anything else that works for you). Every directory has its own README listing its contents and how they should be used. All code that you didn’t write and data that you didn’t generate need to be cited. Again, this is useful to document from the beginning so you don’t need to search for it later.
Most of my work is done in Python, so I wrote up how to handle Python dependencies. The way I suggest going about it is through a ‘.yml‘ file that specifies all the dependencies (i.e. all the packages and versions your script uses) for your project. I believe the best way to handle this is by creating a Python environment for every project you work on so you can create a separate list of dependencies for each. We have a nice blogpost on how to create and manage Python environments here.
When the project is done and you’re ready to submit or publish your paper, export all dependencies by running:
conda env export > environment.yml --no-builds
and store your environment.yml in the GitHub repository. When someone else needs to replicate your results, they would just need to create the same Python environment (running conda env create --file environment.yml)
before executing your scripts.
Finally, you could automate the analysis and figure production with a makefile that executes your scripts so who ever is replicating does not need to manually execute all of them. This also helps avoiding somebody executing the scripts in the wrong order. An example of this can be found in this template. The makefile can also be in Python, like Julie did here.
In recognition that this is not an exhaustive template of everything one might need, the aim is to have this blog post and the template itself evolve as the group identifies needs and material to be added.