24 Jan 2018

# Files and Processes

In the last few months, I have created three notable programming projects: RunForrest saves function call graphs to disk and runs them in parallel processes; TimeUp creates backups using rsync and keeps different numbers of hourly, daily, and weekly backups; and JBOF, which organizes large collections of data and metadata as structured, on-disk datasets.

These projects have one thing in common: They use Python to interact with external things, such as files, libraries, or processes. It surprised me that none of these projects were particularly hard to build, even though they accomplish "hard" tasks. This prompted some soul-searching about why I thought these tasks to be hard, and I have come up with two observations:

1. Files and processes are considered "not part of the language", and are therefore not taught. Most programming classes and programming tutorials I have seen focus on the internals of a programming language, i.e. its data structures, and built-in functions and libraries. Files and processes are not part of this, and are often only mentioned in passing, as a thing that a particularly library can do. Worse, many curricula never formally explain files or processes, or their use in building programs.

I now believe that this is unfortunate and misguided, since the interaction with the computer's resources is the central benefit of programming, and you can not make much use of those resources without a thorough understanding of files and processes. In a way, the "inside" of a programming language is a mere sandbox, a safe place for toying with imaginary castles. But it is the "outside", that is, external programs, libraries, and files, that unlocks the true power of making the computer do work. And in Python in particular, using these building blocks to build useful programs is surprisingly simple.

2. However, such small and simple programs are much less popular than bloated behemoths. I built RunForrest explicitly because Dask was too confusing and unpredictable for the job. I build JBOF because h5py was too complex and slow. And this is surprising, since these tools certainly are vastly more mature than anything I can whip up. But they were developed by large organizations to solve large-organization problems. But I am not a large organization, and my needs are different as well.

I now believe that such small-scale solutions are often preferable to high-profile tools, but they lack the visibility and publicity of tools such as Dask and HDF. And even worse, seeing these big tools solve such mundane problems, we form the belief that these problems must be incredibly complex, and beyond our abilities. Thus forms a vicious cycle. Of course, this is not to say that these big tools do not serve a purpose. Dask and HDF were built to solve particular problems, but we should be aware that most big tools were built for big problems, and our own problems are often not big enough to warrant their use.

In summary, we should teach people about files and processes, and empower them to tackle "hard" tasks without resorting to monolithic libraries. Not only is this incredibly satisfying, it also leads to better programs and better programmers.

29 Dec 2017

# How to set up rsnapshot instead of Time Machine

(This blog post was changed since my initial strategy of disabling the lockfile didn't work. Turns out, the lockfile is required, and backups have to be stacked.)

Yesterday, I wrote about how Time Machine has failed me. Time Machine keeps regular backups, going back as far as your hard drive space permits. In theory. In practice, every year or so it messes up somehow and has to start over, thereby deleting all your older backups. A backup that is not reliable is not a backup.

Luckily, there are alternatives. Probably the easiest is rsync1, a very cool tool that copies files and directories from one place to another. You could simply run this once a day, and have a new backup every day. You can even configure rsync so it doesn't need to copy unchanged files, and instead hard-links them from an older backup. rsnapshot automates this process to keep a number of tiered copies, for example ten hourly backups, seven daily backups, four weekly backups, and a hundred monthly backups. Each backup is then simply a directory that contains your files. No fancy starfield-GUI, but utterly reliable and trivial to understand 2.

Setting up rsnapshot on macOS is not quite as straight-forward as I'd like, and I couldn't find a great guide online. So, without further ado, here's how to configure rsnapshot on macOS:

• Install rsnapshot
brew install rsnapshot

• Write the config file

You can copy a template from homebrew:

cp /usr/local/Cellar/rsnapshot/1.4.2/etc/rsnapshot.conf.default /usr/local/etc/rsnapshot.conf


And then configure the new configuration file to your liking (preserve the tabs!):

config_version	1.2 # default
verbose		2   # default
loglevel	3   # default

# this is where your backups are stored:
snapshot_root	/Volumes/BBackup/Backups.rsnapshot/ # make sure this is writeable
# prevent accidental backup corruption:
lockfile	/Users/bb/.rsnapshot.pid
# use this if you back up to an external drive:
no_create_root	1   # don't back up if the external drive is not connected

# configure how many tiers of backups are created:
retain	hourly	10
retain	daily	7   # dailies will only be created once 10 hourlies exist
retain	weekly	4   # weeklies will only be created once 7 dailies exist
retain	monthly	100 # monthlies will only be created once 4 weeklies exist

# the list of directories you want to back up:
backup	/Users/bb/Documents		localhost/
backup	/Users/bb/eBooks		localhost/
backup	/Users/bb/Movies		localhost/
backup	/Users/bb/Music		localhost/
backup	/Users/bb/Pictures		localhost/
backup	/Users/bb/Projects		localhost/
backup	/Users/bb/Projects-Archive		localhost/


Instead of localhost, you can use remote machines as well. Check man rsync for details

• Make sure it works and create initial backup
rsnapshot -c /usr/local/etc/rsnapshot.conf hourly


The first backup will take a while, but subsequent backups will be fast. A normal backup on my machine takes about two minutes and runs unnoticeably in the background.

• Write launchd Agent

Next, we have to tell macOS to run the backups in regular intervals. Conceptually, you do this by writing a launchd agent script3, which tells launchd when and how to run your backups. In my case, I create four files in /Users/bb/Library/LaunchAgents/, called rsnapshot.{hourly,daily,weekly,monthly}.plist. Apple's documentation for these files is only mildly useful (as usual), but man launchd.plist and man plist should give you an idea how this works.

Here is my hourly launchd agent (I'll explain the bash/sleep thing later):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>rsnapshot.hourly</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>-c</string>
<string>sleep 0 && /usr/local/bin/rsnapshot -c /usr/local/etc/rsnapshot.conf hourly</string>
</array>
<key>StartCalendarInterval</key>
<dict>
<key>Minute</key>
<integer>0</integer>
</dict>
</dict>
</plist>


For the other four scripts, change the two occurrences of hourly to {daily,weekly,monthly} and change the <dict> portion at the end to

• daily:
<key>Minute</key>
<integer>0</integer>
<key>Hour</key>
<integer>0</integer>

• weekly:
<key>Minute</key>
<integer>0</integer>
<key>Hour</key>
<integer>0</integer>
<key>Weekday</key>
<integer>1</integer>

• monthly:
<key>Minute</key>
<integer>0</integer>
<key>Hour</key>
<integer>0</integer>
<key>Day</key>
<integer>1</integer>


However, rsnapshot can only ever run one backup at a time without stepping on its own toes. This is a problem when the computer wakes up, and more than one backup was scheduled during its sleep, since launchd will then happily launch all missed backups at the same time. But only one of them will succeed.

To fix this, I delay the later backup tiers using the sleep 0 directive. I use sleep 900 (15 minutes later) for daily, sleep 1800 (30 minutes), and sleep 2700 (45 minutes) for the lower tiers4. It seems that there should be a more elegant solution than this, but I haven't found one.

From the documentation, you might think that <key>Program</key> would be more succinct than supplying the binary as the first argument of <key>ProgramArguments</key>, but this apparently uses a different syntax and does not in fact work as expected.

launchctl load ~/Library/LaunchAgents/rsnapshot.*

• Test launchd agent
launchctl start rsnapshot.hourly


If it doesn't work, Console.app might show a relevant error message.

• Remove backup directory from Spotlight

Go to System Preferences → Spotlight → Privacy → Add your snapshot_root directory from earlier

• Disable TimeMachine and delete your existing backup (if you want)

Start Time Machine, right-click any directory you want to delete, and select "delete all backups of \$dir"

## Caveats

The configuration file of rsnapshot says that you might experience data corruption if you run several copies of rsnapshot at the same time (and you can use the lockfile to prevent this). This is a problem if your computer is asleep while rsnapshot is scheduled to run, since launchd will then re-schedule all missed tasks at once when the computer wakes up. If you enable the lockfile, only one of them will run.

On the other hand, only the hourly task will actually create a new backup. All higher-level backup tiers merely copy existing backups around, so in theory, they shouldn't step on each other's toes when run concurrently. I have opened an issue asking about this.

There are other possible solutions: ① You could modify the launchd entry such that backups only trigger after a few minutes or, better yet, only once all other instances of rsnapshot have finished. I am not sure if launchd supports this, though. ② You could schedule the hourly task using cron instead of launchd, since cron will not reschedule missed tasks. This would only work for two tiers of backups, though. ③ You could just ignore the issue and hope for the best. After all, if a daily or hourly backup gets corrupted every now and then, you still have enough working backups…

## Footnotes:

1

rsync is one of those reliable tools I talked about. It is rock solid, incredibly versatile, and unapologetically single-minded. A true gem!

2

This works great for local backups. If you need encrypted backups or compressed backups (maybe on an untrusted remote machine), this post recommends Borg instead of rsnapshot, but you will lose the simplicity of simple directories.

3

I use launchd instead of cron since launchd will re-schedule missed backups if the computer was asleep.

4

This will fail if the hourly backup takes longer than 15 minutes. This is rather unlikely, though, or at least should not happen often enough to be of concern.

28 Dec 2017

# Dropbox deleted my pictures and Time Machine didn't backup

Dropbox deleted some of my favorite photos. Have you looked at all your old pictures lately and checked if they are still there? I have, and they were not. Of course Dropbox denies it is their fault, but no other program routinely accessed my pictures. I am not alone with this problem. It must have happened some time between the summer of 2015, when I put my pictures on Dropbox, and the summer of 2016, when Time Machine last corrupted its backups and had to start over, thereby deleting my last chance of recovering my pictures. The pictures are gone for good.

So, what have I learned? Dropbox loses your data, and Time Machine can't restore it. These programs are obviously no good for backups. Let me repeat this: Dropbox and Time Machine are not a backup! A true backup needs to be reliable, keep an infinite history, and never, never, never accidentally delete files.

From now on, I will use rsnapshot for backups. Here's a tutorial on how to set it up on a Mac. I have used rsnapshot for years at work, and it has never let me down. For syncronizing things between computers, I now use syncthing. Both of these programs are not as user-friendly as Dropbox or Time Machine, but that is a small price to pay for a working backup.

A few years ago, I had high hopes that Apple and Dropbox and Google and Amazon would lead us to a bright future of computers that "just work", and could solve our daily chores ever more conveniently and reliably. But I was proven wrong. So. Many. Times. It seems that for-profit software inevitably becomes less dependable as it adds ever more features to attract ever more users. In contrast, free software can focus on incremental improvements and steadily increasing reliability.

20 Dec 2017

# Books of 2017

In late 2016, I took a short ferry flight to a small island in the area, and rekindled my love for aviation. Shortly afterwards, I started training for a pilot's license, and reading about aviation. From a literary perspective, aviation exists in the perfect goldilocks time frame of being just old enough to be thoroughly romanticized, but young enough for first-hand reports and thorough documentation to be available. What is more, powered flight has provided human observers with an unprecedented view of our world and our struggles, and is often as philosophical as it is exhilarating.

Out of a long list of fascinating books on aviation I have read over the last two years, my favorites are:

• Fate is the Hunter by Ernest K. Gann, a gripping memoir of the early days of aviation. It has been only a little more than a hundred years since humans first took to the skies in Kitty Hawk in 1905, yet today aviation feels as mundane as horse-drawn carriages must have felt to the Wright Brothers. Gann lived through these early days, and tells his tales from a time when aviation was still young, dangerous, and perhaps more interesting. If you want to read more like this, Flight of Passage by Rinker Buck, and The Spirit of St. Louis by Charles Lindbergh are easy recommendations as well.
• Carrying the Fire by Michael Collins, one of the few first-hand accounts of an Apollo astronaut's voyage to the Moon. Most astronauts have published books later on, but most had them ghost-written, and none are as visceral and engaging as Michael Collins' journey on Apollo 8. It is humbling that Apollo's achievements have not been surpassed, despite our much more advanced technology and science. Other accounts worth reading are The Last Man on the Moon by Gene Cernan on Apollo 17, and How Apollo Flew to the Moon by W. David Woods, for a more technical view.
• Skyfaring by Mark Vanhoenacker is a more modern, and more philosophical account of how aviation has changed our perception of the world. If you yearn to fly like I do, then this book is a balm for the soul. The almost spiritual feeling of cutting your bonds with the ground is what this book is about, despite being written in today's un-romantic days of routine commercial airliners. I love it dearly. A more grounded account of aviation's history is Turbulent Skies by T. A. Heppenheimer, and maybe Slide Rule by Nevil Chute for some history on airships.

But as much as I love aviation, my first love is still Science Fiction. We live in strange times of unprecedented prosperity, and yet we are strangely unsatisfied, as if the future didn't turn out to be the utopia it was meant to be. Or is this just a reflection of ourselves, how we do not live up to the future we built?

• A Closed and Common Orbit by Becky Chambers describes a more distant future, when space travel is as mundane as airliners are today, and humanity is just one of several alien species. Yet, with all its technological marvels, we still yearn for meaning and love, regardless of what strange world we live in. A Closed and Common Orbit is the second book in the series, and the first book I read. I think I prefer it in this order.

• The Laundry Files Series by Charles Stross is closer to home. Did you ever notice how computer programming is eerily similar to the arcane incantations we use to describe magic in fiction? Where is the difference between invoking a function that effects a robot, and invoking a magic spell that affects a demon? According to Charles Stross, this difference is really only playing with semantics, and we should be very careful with our incantations, lest the ghost in the machine really does have our demise in mind. These books have made me laugh out loud so many times, like when a weaponized PowerPoint turned people into zombies, or when a structured cabling project turned out to create an inadvertent summoning grid for an elder horror.
20 Nov 2017

# PyEnv is the new Conda

How to install Python? If your platform has a package manager, you might be tempted to use that to install Python. But I don't like that. That version is often outdated, and you risk messing with an integral part of your operating system. Instead, I like to install a separate Python in my home directory. I used to use Anaconda (or WinPython, or EPD) to do this. But now there is a better way: PyEnv

The thing is: PyEnv installs (any version of) Python. That's all it does.

So why would I choose PyEnv over the more popular Anaconda? Because Anaconda is a Python distribution, a package manager, an environment manager, and a platform for paid packages. In the past, they once did break pip because they wanted to promote conda instead. Some features of conda require a login, some require a paid subscription. When you install packages through conda, you get binaries and source code from anaconda's servers, not the official packages from PyPi, which might or might not be up-to-date and feature-complete. For every package you install, you have to make a choice of using pip or conda, and the same goes for specifying your dependencies.

As an aside, many of these complaints are just as true for package-manager-provided Python packages (which often break pip, too!). Just like Anaconda, package managers want to be the true and only source of packages, and don't like to interact with Python's own package manager.

In contrast, with PyEnv, you install a Python. This can be a version of CPython, PyPy, IronPython, Jython, Pyston, stackless, miniconda, or even Anaconda. It downloads the sources from the official repos, and compiles them on your machine 1. Plus, it provides an easy and transparent way of switching between installed versions (including any system-installed versions). After that, you use Python's own venv and pip.

I find this process much simpler, and easier to manage, because it relies on small, orthogonal tools (pyenv, venv, pip) instead of one integrated conda that kind of does everything. I also like to use these official tools and packages instead of conda's parallel universe of mostly-open, mostly-free, mostly-standard replacements.

Mind you, conda solved real problems back in the day (binary package distributions, Python version management, and environment management), and arguably still does (MKL et al, paid packages). But ever since wheels became ubiquitous and painless, and virtualenv was integrated into Python, and the development of PyEnv, these issues now have better solutions, and conda is no longer needed for my applications.

## Footnotes:

1

the downside of compilation is: no Windows support.