27 May 2020

How to Write a Dissertation

Assembling scientific documents is a complex task. My documents are a combination of graphs, data, and text, written in LaTeX. This post is about combining these elements, keeping them up to date, while not losing your mind. My techniques work on any Unix system on Linux, macOS, or the WSL.

Text

For engineering or science work, my deliverables are PDFs, typically rendered from LaTeX. But writing LaTeX is not the most pleasant of writing environments. So I've tried my hand at org-mode and Markdown, compiled them to LaTeX, and then to PDF. In general, this worked well, but there always came a point where the abstraction broke, and the LaTeX leaked up the stack into my document. At which point I'd essentially write LaTeX anyway, just with a different syntax. After a few years of this, I decided to cut the middle-man, bite the bullet, and just write LaTeX.

That said, modern LaTeX is not so bad any more: XeLaTeX supports normal OpenType fonts, mixed languages, proper unicode, and natively renders to PDF. It also renders pretty quickly. My entire dissertation renders in less than three seconds, which is plenty fast enough for me.

To render, I run a simple makefile in an infinite loop that recompiles my PDF whenever the TeX source changes, giving live feedback while writing:

diss.pdf: diss.tex makefile $(graph_pdfs)
	xelatex -interaction nonstopmode diss.tex

We'll get back to $(graph_pdfs) in a second.

Graphs

A major challenge in writing a technical document is keeping all the source data in sync with the document. To make sure that all graphs are up to date, I plug them into the same makefile as above, but with a twist: All my graphs are created from Python scripts of the same name in the graphs directory.

But you don't want to simply execute all the scripts in graphs, as some of them might be shared dependencies that do not produce PDFs. So instead, I only execute scripts that start with a chapter number, which conveniently sorts them by chapter in the file manager, as well.

Thus all graphs render into the main PDF and update automatically, just like the main document:

graph_sources = $(shell find graphs -regex "graphs/[0-9]-.*\.py")
graph_pdfs = $(patsubst %.py,%.pdf,$(graph_sources))

graphs/%.pdf: graphs/%.py
	cd graphs; .venv/bin/python $(notdir $<)

The first two lines build a list of all graph scripts in the graphs directory, and their matching PDFs. The last two lines are a makefile recipy that compiles any graph script into a PDF, using the virtualenv in graphs/.venv/. How elegant these makefiles are, with recipe definitions independent of targets.

This system is surprisingly flexible, and absolutely trivial to debug. For example, I sometimes use those graph scripts as glorified shell scripts, for converting an SVG to PDF with Inkscape or some similar task. Or I compile some intermediate data before actually building the graph, and cache them for later use. Just make sure to set an appropriate exit code in the graph script, to signal to the makefile whether the graph was successfully created. An additional makefile target graphs: $(graph_pdfs) can also come in handy if you want ignore the TeX side of things for a bit.

Data

All of the graph scripts and TeX are of course backed by a Git repository. But my dissertation also contains a number of databases that are far too big for Git. Instead, I rely on git-annex to synchronize data across machines from a simple webdav host.

To set up a new writing environment from scratch, all I need is the following series of commands:

git clone git://mygitserver/dissertation.git dissertation
cd dissertation
git annex init
env WEBDAV_USERNAME=xxx WEBDAV_PASSWORD=yyy git annex enableremote mywebdavserver
git annex copy --from mywebdavserver
(cd graphs; pipenv install)
make all

This will download my graphs and text from mygitserver, download my databases from mywebdavserver, build my Python environment with pipenv, recreate all the graph PDFs, and compile the TeX. A process that can take a few hours, but is completely automated and reliable.

And that is truly the key part; The last thing you want to do while writing is being distracted by technical issues such as "where did I put that database again?", "didn't that graph show something different the other day?", or "I forgot to my database file at work and now I'm stuck at home during the pandemic and can't progress". Not that any of those would have ever happened to me, of course.