Basti's Scratchpad on the Internet
14 May 2017

Scientific Code

A short while ago, I spent a few weeks collecting and evaluating various implementations of speech analysis algorithms for my current work. TL;DR: the general quality of scientific software is bad, and needs to improve.

Let me state up front that I am explicitly not talking about the quality of the science itself. This blog post is exclusively focused on the published software. In general, the problems were bad programming, old and unmaintained code, and lack of documentation. While none of these things mean bad science, they are a real challenge to reproducibility. Follow-up work can not quote algorithms they can't run.

Most scientists are not programmers, and use programming as a tool for doing science. Scientists typically don't conceptualize algorithms as code, and don't reason about their implementation in terms of code. Consequently, executable implementations tend to not read like code, either. And frankly, this is as it should be. In my area of research, programming abstractions are not a great medium for expressing scientific ideas, and programming transformations are weak in comparison to mathematical operations.

Still, the result is bad code. And bad code is a problem for reproducibility and comparability. Scientists therefore need to produce better code that is readable and portable. Straight-up letter-by-letter translations of the published math is just not enough, with all its single-letter variable names and complex nested expressions. We have to do better.

Include an Example Script

I have seen many instances of code that clearly worked at one point on the original author's machine, but doesn't on my computer. Maybe that is because I am running a different version of the programming environment, maybe my data is subtly different, maybe the author forgot to document a dependency. All of these happened to me, and all of these are exactly what you would expect from old, unmaintained code from non-expert programmers.

The least we can do to improve this situation, is to document how the code was supposed to work. A simple example script with example data and example results lets me verify that the code does what it is supposed to be doing. Without this, I can not diagnose errors. Worse, if no example results are included, I might conclude that the algorithm was bad, when in reality I was simply using an incorrect version of a dependency.

Document Dependencies

It is important to document all toolboxes, libraries, and programming languages used, including their exact version and operating system. At one point, I spent several hours debugging some Java code that worked in Java 5, but didn't on any recent Java version. If this had been documented, I could have fixed that error much more quickly. In another case, one piece of C code contained a subtle error if compiled with a 64 bit compiler. Again, this took a long time to track down, and would have been much easier if the original operating system and compiler version had been documented.

That same piece of C code could only be run by compiling several library dependency from old versions of their source code. In that case, the author helpfully included copies of the original source code for these libraries with his source distribution. Without that, I would never have gotten that piece of code to run!

Also, if at all possible, published code should use as few dependencies as possible. Not only might I not have access to that fancy Matlab Toolbox, but every step in the installation instructions is another thing that can go wrong. The more contained the code, the more likely it is that it will still be executable many years later.

Higher Level Languages are better than Lower Level Languages

In general, I had far less problems with non-compiled, high-level code for Matlab, Python, or R, than with low-level C or Java code. On the one hand, this is a technological problem: It is easier to keep an interpreter compatible with outdated code than it is to keep a more complex tool chains with compilers, linkers, and runtime libraries compatible. On the other hand though, this is a human problem as well: I happen to know Python, Matlab, and C pretty well, but I don't know much R or Java. Still, it is much easier to reason about the runtime behavior of R code (because I can inspect variables like in any other dynamic programming language) than to debug some unknown build tool interaction in Java (because compilation tool chains are typically much more complex and variable than REPLs).

Keep it Simple

A particular problem are compiled modules for an interpreted language. Such Mex files and CPython extensions are almost guaranteed to break when programming language versions change, and are often hard to upgrade, since the C APIs often change with the language version as well. Often, these compiled modules are provided for performance reasons. But what was painfully slow on the original author's work laptop a few years ago might not be a problem at all on a future compute cluster. If the code absolutely has to be provided a Mex file or a CPython extension, we should at least provide an interpreted version as well.

And speaking of compute clusters, I have had a lot of trouble with clever tricks such as massively parallelized code or GPU-optimized code. Not only does this age poorly, it also wreaks havoc when trying to run this code on a compute cluster. Such low-level performance optimizations are rarely a good idea for published scientific code, and should at least be optional and well-documented.

Document the Code

Code is never self-documenting. Given enough time, entropy increases inexorably, and times change. Conventions change, and what seemed obvious a few years ago might look like gibberish today. What is perfectly readable to a domain expert is inscrutable even to scientists of closely related areas of research. It is therefore necessary to always document published code, even if the code looks perfectly obvious to the author. In the same vein, we should refrain from using jargon in our code, and clearly declare our variables and invariants (they are bound to be different for different people).

Importantly, this includes documenting all file formats and data structures. Institutions often have in-house standards for how to represent certain kinds of data, and practitioners don't realize that these conventions are not followed universally. I had to ignore a few apparently high-quality algorithms simply because I could not figure out how to supply data to their code. For many other algorithms, I had to write custom data exporters and data importers. Again, an algorithm won't get quoted if it can't be run.

Make it Automatable

Another thing I see frequently are very fancy graphical user interfaces for scientific code. The only thing that breaks faster than compiled language extensions are GUI programs. GUIs are necessarily complex, and therefore hard to debug. And worse yet, if the code can only be run in a GUI, it can't be automated, and I won't be able to compare its performance in a big experiment with hundreds of runs. In effect, GUIs make an algorithm non-reproducible. If a GUI needs to be included in the code, it should at least be made optional.

Do Release Code

However, more importantly than anything I said so far: Do release source code for your algorithm. Nothing is more wasteful than reading about an amazing algorithm that I can't try out. If no source code is released, it is not reproducible, it can't be compared to other algorithms in the field, and it might as well not have been published.

12 Apr 2017

Matlab Metaprogramming

Why is it, that I find Matlab to be a fine teaching tool and a fine tool for solving engineering problems, but at the same time, extremely cumbersome for my own work? Recently, the answer struck me: metaprogramming in Matlab sucks.

Matlab is marketed as a tool for engineers to solve engineering problems. There are convenient data structures for numerical data (arrays, tables), less convenient data structures for non-numeric data (cells, structs, chars), and a host of expensive but powerful functions and methods for working with this kind of data. This is the happy path.

But don't stray too far from the happy path; horrors lurk where The Mathworks don't dare going. Basic stuff like talking to sockets or interacting with other programs is very cumbersome in Matlab, and sometimes even downright impossible for the lack of threads, pipes, and similar infrastructure. But this is common knowledge, and consistent with Matlab's goals as an engineering tool, not a general purpose programming language. These are first-order problems, and they are rarely insurmountable.

The more insidious problem is metaprogramming, i.e. when the objects of your code are code objects themselves. The first order use of programming is to solve real-world problems. If these problems are numeric in nature, Matlab has got you covered. But as every programmer discovers at some point, the second order use of programming is to solve programming problems. And by golly, will Matlab let you down when you try that!

As soon as you climb that ladder of abstraction, and the objects of your code become code objects themselves, you will enter weird country. You think exist will tell you whether a variable name is taken? Try calling it on a method. You think nargout will always give you a number? Again, methods will enlighten you. Quick, how do you capture all output arguments of a function call in a variable? x = cell(nargout(fun), 1); [x{:}] = fun(...), obviously (this sometimes fails). And don't even think of trying to overload subsref to create something generic. Those subsref semantics are crazy talk!

I could go on. The real power of code is abstraction. We use programming to repeatedly and reliably solve similar problems. The logical next step is to use those same programming tools to solve similar kinds of problems. This happens to all of us, and Matlab makes it extremely hard to deal with. Thus, it puts a ceiling on the level of abstraction that is reasonably achievable, and limits engineers to first-order solutions. And after a few years of acclimatization, it will put that same ceiling on those engineers' thinking, because you can't reason about what you can't express.

24 Nov 2016

Pip is great

Installing Python packages used to be a pain. Very surprisingly, this is no longer the case. Nowadays, pip install whatever will reliably install pretty much anything without any trouble.

To recap, the problem is that many Python packages rely on C code, which needs to be compiled before installation. In the past, this burden was mostly on the user. Depending on the user's knowledge of C and compilers, and the user's operating system, this could become almost arbitrarily hairy.

This problem was solved, to some extent, using pre-compiled packages, first as binary installers on the package websites, then pre-packaged Python distributions, and later through third-party package managers such as conda.

This worked well, but it fractured the ecosystem into several different mostly-compatible package sources. This was no big problem, but users had to decide on a package-by-package basis whether to download an installer, use conda, or use pip.

But I am happy to announce that these dark days are behind us now. Pip now automatically installs binary Python packages called wheels, and most package developers do now provide wheels on PyPI. This is so obviously superior to the past that many high-profile packages don't even provide binary installers any more.

For me, this means I don't have to rely on conda any longer. I don't have to keep a mental list of which packages to install though conda and which packages to install through pip any longer. I can go back to using virtualenv instead of conda-envs1 again. I don't have to tell students to download pre-packaged Python distributions any longer. Pip is great!

Footnotes:

1

Not because I dislike conda, just to simplify my development environment.

22 Nov 2016

Soma

For a long time, I was afraid of picking up Soma, since it came from the developers who did Amnesia. Amnesia was the first game that scared me so thoroughly that I just couldn't bring myself to pick it up again after the first session. I admired it for its amazing world design, interactivity, and story, but it was too scary for me. I don't enjoy being scared.

I was thus of two minds when I heard about Soma: Soma was supposed to be less scary, and SciFi. But still, let me get this out of the way; This game scared me whitless. I had to use an online game guide to warn me of monsters and tell me how to evade them. I was too scared to figure this stuff out by myself, all alone in this creepy, nasty, underwater space station.

But at the same time, this game delivered one of the best stories I have ever seen in a video game. The story left me shaken and thoughtful for days, and the story was what dragged me back into the game even though I was really struggling with the monster sections. Even the delivery through conversations, environmental clues, and audio logs, was truly outstanding. I just love grimy Alien-style SciFi, and this is one of the very few video game stories that could fill more than a paragraph or two in a well-written book. I love it.

Apparently, if you like horror games, this is a very mild one. Boring even, if you can believe my favorite online critics. But believe me, it was right on the edge of what I was able to take. I still can't quite decide whether I would have preferred this game without the monsters, or whether the monsters were a necessary tool for evoking this intense sense of dread on the lonely SciFi ocean floor. Maybe the monsters lent some much-needed life and interactivity to the walk-em-up formula.

At any rate, I thoroughly enjoyed this experience, and can't wait for what this developer will create next. I wholeheartedly recommend this, as one of the best SciFi stories of this year, in any medium. ★★★★★

07 Sep 2016

MATLAB Syntax

In a recent project, I tried to parse MATLAB code. During this trying exercise, I stumbled upon a few… unique design decisions of the MATLAB language:

Use of apostrophes (')

Apostrophes can mean one of two things: If applied as a unary postfix operator, it means transpose. If used as a unary prefix operator, it marks the start of a string. While not a big problem for human readers, this makes code surprisingly hard to parse. The interesting bit about this, though, is the fact that there would have been a much easier way to do this: Why not use double quotation marks for strings, and apostrophes for transpose? The double quotation mark is never used in MATLAB, so this would have been a very easy choice.

Use of parens (())

If Parens follow a variable name, they can mean one of two things: If the variable is a function, the parens denote a function call. If the variable is anything else, this is an indexing operation. This can actually be very confusing to readers, since it makes it entirely unclear what kind of operation foo(5) will execute without knowledge about foo (which might not be available until runtime). Again, this could have been easily solved by using brackets ([]) for indexing, and parens (()) for function calls.

Use of braces ({}) and cell arrays

Cell arrays are multi-dimensional, ordered, heterogeneous collections of things. But in contrast to every other collection (structs, objects, maps, tables, matrices), they are not indexed using parens, but braces. Why? I don't know. In fact, you can index cell arrays using parens, but this only yields a new cell array with only one value. Why would this ever be useful? I have no explanation. This constantly leads to errors, and for the life of me I can not think of a reason for this behavior.

Use of line breaks

In MATLAB, your line can end on one of three characters: A newline character, a semicolon (;), and a comma (,). As we all know, the semicolon suppresses program output, while the newline character does not. The comma ends the logical line, but does not suppress program output. This is a relatively little-known feature, so I thought it would be useful to share it. Except, the meaning of ; and , changes in literals (like [1, 2; 3, 4] or {'a', 'b'; 3, 4}). Here, commas separate values on the same row and are optional, and semicolons end the current row. Interestingly, literals also change the meaning of the newline character: Inside a literal, a newline acts just like a semicolon, overrides a preceding comma, and you don't have to use ellipsis (...) for line continuations.

Syntax rules for commands

Commands are function calls without the parenthesis, like help disp, which is syntactically equivalent to help('disp'). You see, if you just specify a function name (can't be a compound expression or a function handle), and don't use parenthesis, all following words will be interpreted as strings, and passed to the function. This is actually kind of a neat feature. However, how do you differentiate between variable_name + 5 and help + 5? The answer is: Commands are actually a bit more complex. A command starts with a function name, followed by a space, which is not followed by an operator and a space. Thus, help +5 + 4 is a command, while help + 5 + 4 is an addition. Tricky!

The more-than-one-value value

If you want to save more than one value in a variable, you can use a collection (structs, matrices, maps, tables, cell arrays). In addition though, MATLAB knows another way of handling more than one values at once: The thing you get when you index a cell array with {:} or assign a function call with more than one result. In that case, you get something that is assignable to several variables, but that is not itself a collection. Just another quirk of MATLAB's indexing logic. However, you can capture these values into matrices or cell arrays using brackets or braces, like this: {x{:}} or [x{:}]. Note that this also works in assignments in a confusing way: [z{:}] = x{:} (if both x and z have the same length). Incidentally, this is often a neat way of converting between different kinds of collections (but utterly unreadable, because type information is hopelessly lost).

Older posts