Basti's Scratchpad on the Internet
13 Mar 2016

The Style of Scientific Code

What does quality code look like? One common school of thought focuses on small, descriptive functions that take few arguments. To quote from Clean Code: "The first rule of functions is that they should be small", on the order of less than ten lines. "Functions should not be large enough to hold nested structures". "The ideal number of arguments for a function is zero [, one, or two]. Three [or more] arguments should be avoided where possible".

A few years ago, when I was working mostly on user interaction and data management, all of this made sense to me. What is the overhead of a few function calls and class lookups here and there if it makes the code more readable? In other words: Readability counts, and is usually more important than performance.

But lately, I have come to struggle with these rules. I am now writing a lot of scientific code, where algorithms are intrinsically complex beyond the syntactic complexity of the code. How do you "Express yourself in code [instead of comments]", when that code only consists of linear algebra and matrix multiplications?

def rectwin_spectrum(angular_frequency, specsize):
    """The spectrum of a rectangular window. [...]"""
    # In case of angular_frequency == 0, this will calculate NaN. Since
    # this will be corrected later, suppress the warning.
    with np.errstate(invalid='ignore'):
        spectrum = ( np.exp(-1j*angular_frequency*(specsize-1)/2) *
                     np.sin(specsize*angular_frequency/2) /
                     np.sin(angular_frequency/2) )
    # since sin(x) == x for small x, the above expression
    # evaluates to specsize for angular_frequency == 0.
    spectrum[angular_frequency == 0.0] = specsize
    return spectrum

A lot of my scientific code ends up quite compact like that. Maybe a hundred lines of dense numeric expressions, plus a few hundred lines of explanations and documentation. The point is, scientific code often does not decompose into easily understood extractable functions.

On a related issue, how do you avoid long argument lists in heavily parametrized equations? As Clean Code states, "when a function seems to need more than two or three arguments, it is likely that some of those arguments ought to be wrapped in a class of their own". However, in Matlab in particular, it is quite unusual to create small one-trick classes to encapsulate a few function arguments:

classdef SignalBlocks < handle
    properties
        data
        samplerate
        blocksize
        hopsize
    end
    properties (Dependent)
        duration
    end
    methods
        function obj = SignalBlock(data, samplerate, blocksize, hopsize)
            % blocksize and hopsize are optional. What a mess.
            narginchk(2, 4);
            obj.data = data;
            obj.samplerate = samplerate;
            if nargin >= 3
                obj.blocksize = blocksize;
            else
                obj.blocksize = 2048;
            end
            if nargin == 4
                obj.hopsize = hopsize;
            else
                obj.hopsize = 1024;
            end
        end
        function time = get.duration(obj)
            time = length(obj.data)/obj.samplerate;
        end
    end
end

This is not just cumbersome to write and maintain, it is also slower than passing data, samplerate, blocksize, and hopsize to each function call individually (although the overhead has gotten considerably smaller in newer versions of Matlab). Additionally, there is often a large performance benefit of not extracting every function and not keeping intermediate values in variables. Thus, it's not just readability that is hard to maintain in scientific code. Performance is a problem, too.

The sad thing is, I don't know the answer to these questions. There have been a lot of discussions about coding style and code quality in our department lately, with the clear objective to clean up our code. But common code quality criteria don't seem to apply to scientific code all that well.

Do you have any idea how to progress from here?

23 Jan 2016

Toren

I've been playing a lot of indie games lately. One of them has not been talked about much: Toren. Toren is a platformer about a girl that has to climb a tower to defeat a dragon and revive her world. This is probably the least polished game I have played in a long time. Animations are janky, controls are imprecise and clunky, and there are loads of little glitches. Yet, I really enjoyed this.

There is something about this world that feels honest to me: As you climb the tower, the child grows from a toddler to an adolescent, and is gradually introduced to more and more mature concepts. I didn't understand much of the iconography of this game, but it felt oddly cathartic to climb this tower of life, and overcome it's challenges.

I particularly liked how death played such an integral role in this story and some of the puzzles. The tower is a monument to a dead people, and yet the story and game mechanics are as much about dying as they are about rebirth and not giving up. This is underlined by the wonderful art style of this game, which contrasts vivid colors with brooding, dark architecture.

At just about two hours, Toren is not a long game. Instead of exploring one particular game mechanic, it mixes it's game up every few minutes. Every sequence looks different and beautiful, and yet it manages to tell a cohesive and effective story. ★★★★☆

06 Jan 2016

Books of 2015

Among Others, by Jo Walton

Among_Others_(Jo_Walton_novel).jpg I don't usually enjoy fantasy novels and their romantic escapism. I much prefer fascinating sci-fi thought experiments. But this book won all the most important awards, so I gave it a shot. What if random chance could be bent a little with creativity, the power of believing in something, and some mysticism? You end up with a world that is richer, more meaningful, and altogether more alive, if you just cared to observe and to appreciate it's beauty. Reading this book left me enchanted and more observant long after I put it down. What a wonderful book!

A Son of the Circus, by John Irving

ASonOfTheCircus.JPG This is one of those books that was on my to-read list for months. It starts out as quirky and likeable as you would expect from John Irving. This time, we follow the tale of a Canadian/Indian doctor throughout his life, and his summer vacation in India. But this would not be John Irving if there weren't plenty of colourful characters, astute observations of human strangeness, and a meticulously crafted story. There is no scene in this book that does not serve a purpose, and so many moving parts my mind just boggles at the construction of it all. Yet at the same time, I was regularly laughing out loud. I loved every minute of this!

The Windup Girl, by Paolo Bacigalupi

Wind_up.jpg The whole world changed in the near future, when gasoline is a rare luxury, sea levels have risen and swallowed all the coastal cities, and man-made scourges have devastated most crops. And now it's not just humans that populate our urbanized world, but so too are our inventions, artificial humans called "windups" for their stutter-stop movements. But at the core, both humans and windups struggle for the same security, prosperity as ever. Such an inventive world, so much vivid creativity, social commentary, in this human struggle to not destroy ourselves.

Zen and the art of motorcycle maintenance, by Robert Pirsig

Zen_motorcycle.jpg This is about equal parts a motorcycle journey of a father and his son across the US, and a dive into another man's discoveries of philosophy. To be honest, I liked this book more for it's character descriptions and travelling adventures than it's philosophy. I am really conflicted about putting this book on this list at all, but I kept thinking about this long after I finished reading it, so I guess this had a bigger influence on me than I realized.

Nexus/Crux/Apex, Rad/Blue/Green Mars, The Martian, The Three-Body Problem

What happens when you take today's world, and add nanotech brain upgrades (Nexus/Crux/Apex), or strand a lone scientist on Mars (The Martian), or send a large number of people to found a new colony on Mars (Red/Blue/Green Mars), or suddenly make contact with Aliens (The Three-Body Problem)? This what-if is what Science Fiction does best: Take this little what-if, and spin a gripping yarn from that. These books inspired me, made me think, were incredibly thrilling, but they did not have a lasting impact. Still well worth a read if you like Science Fiction, though.

03 Nov 2015

Calling Matlab from Python

For my latest experiments, I needed to run both Python functions and Matlab functions as part of the same program. As I noted earlier, Matlab includes the Matlab Engine for Python (MEfP), which can call Matlab functions from Python. Before I knew about this, I created Transplant, which does the very same thing. So, how do they compare?

Usage

As it's name suggests, Matlab is a matrix laboratory, and matrices are the most important data type in Matlab. Since matrices don't exist in plain Python, the MEfP implements it's own as matlab.double et al., and you have to convert any data you want to pass to Matlab into one of those. In contrast, Transplant recognizes the fact that Python does in fact know a really good matrix engine called Numpy, and just uses that instead.

       Matlab Engine for Python        |              Transplant
---------------------------------------|---------------------------------------
import numpy                           | import numpy
import matlab                          | import transplant
import matlab.engine                   |
                                       |
eng = matlab.engine.start_matlab()     | eng = transplant.Matlab()
numpy_data = numpy.random.randn(100)   | numpy_data = numpy.random.randn(100)
list_data = numpy_data.tolist()        |
matlab_data = matlab.double(list_data) |
data_sum = eng.sum(matlab_data)        | data_sum = eng.sum(numpy_data)

Aside from this difference, both libraries work almost identical. Even the handling of the number of output arguments is (accidentally) almost the same:

       Matlab Engine for Python        |              Transplant
---------------------------------------|---------------------------------------
eng.max(matlab_data)                   | eng.max(numpy_data)
>>> 4.533                              | >>> [4.533 537635]
eng.max(matlab_data, nargout=1)        | eng.max(numpy_data, nargout=1)
>>> 4.533                              | >>> 4.533
eng.max(matlab_data, nargout=2)        | eng.max(numpy_data, nargout=2)
>>> (4.533, 537635.0)                  | >>> [4.533 537635]

Similarly, both libraries can interact with Matlab objects in Python, although the MEfP can't access object properties:

       Matlab Engine for Python        |              Transplant
---------------------------------------|---------------------------------------
f = eng.figure()                       | f = eng.figure()
eng.get(f, 'Position')                 | eng.get(f, 'Position')
>>> matlab.double([[ ... ]])           | >>> array([[ ... ]])
f.Position                             | f.Position
>>> AttributeError                     | >>> array([[ ... ]])

There are a few small differences, though:

  • Function documentation in the MEfP is only available as eng.help('funcname'). Transplant will populate a function's __doc__, and thus documentation tools like IPython's ? operator just work.
  • Transplant converts empty matrices to None, whereas the MEfP represents them as matlab.double([]).
  • Transplant represents dict as containers.Map, while the MEfP uses struct (the former is more correct, the latter arguable more useful).
  • If the MEfP does not know nargout, it assumes nargout=1. Transplant uses nargout(func) or returns whatever the function writes into ans.
  • The MEfP can't return non-scalar structs, such as the return value of whos. Transplant can do this.
  • The MEfP can't return anonymous functions, such as eng.eval('@(x, y) x>y'). Transplant can do this.

Performance

The time to start a Matlab instance is shorter in MEfP (3.8 s) than in Transplant (6.1 s). But since you're doing this relatively seldomly, the difference typically doesn't matter too much.

More interesting is the time it takes to call a Matlab function from Python. Have a look:

execution%20time.png

This is running sum(randn(n,1)) from Transplant, the MEfP, and in Matlab itself. As you can see, the MEfP is a constant factor of about 1000 slower than Matlab. Transplant is a constant factor of about 100 slower than Matlab, but always takes at least 0.05 s.

There is a gap of about a factor of 10 between Transplant and the MEfP. In practice, this gap is highly significant! In my particular use case, I have a function that takes about one second of computation time for an audio signal of ten seconds (half a million values). When I call this function with Transplant, it takes about 1.3 seconds. With MEfP, it takes 4.5 seconds.

Transplant spends its time serializing the arguments to JSON, sending that JSON over ZeroMQ to Matlab, and parsing the JSON there. Well, to be honest, only the parsing part takes any significant time, overall. While it might seem onerous to serialize everything to JSON, this architecture allows Transplant to run over a network connection.

It is a bit baffling to me that MEfP manages to be slower than that, despite being written in C. Looking at the number of function calls in the profiler, the MEfP calls 25 functions (!) on each value (!!) of the input data. This is a shockingly inefficient way of doing things.

TL;DR

It used to be very difficult to work in a mixed-language environment, particularly with one of those languages being Matlab. Nowadays, this has thankfully gotten much easier. Even Mathworks themselves have stepped up their game, and can interact with Python, C, Java, and FORTRAN. But their interface to Python does leave something to be desired, and there are better alternatives available.

If you want to try Transplant, just head over to Github and use it. If you find any bugs, feature requests, or improvements, please let me know in the Github issues.

29 Oct 2015

Massive Memory Leak in the Matlab Engine for Python

As of Matlab 2014b, Matlab includes a Python module for calling Matlab code from Python. This is how you use it:

import numpy
import matlab
import matlab.engine

eng = matlab.engine.start_matlab()
random_data = numpy.random.randn(100)
# convert Numpy data to Matlab:
matlab_data = matlab.double(random_data.tolist())
data_sum = eng.sum(matlab_data)

You can call any Matlab function on eng, and you can access any Matlab workspace variable in eng.workspace. As you can see, the Matlab Engine is not Numpy-aware, and you have to convert all your Numpy data to Matlab double before you can call Matlab functions with it. Still, it works pretty well.

Recently, I ran a rather large experiment set, where I had a set of four functions, two in Matlab, two in Python, and called each of these functions a few thousand times with a bunch of different data to see how they performed.

While doing that I noticed that my Python processes were growing larger and larger, until they consumed all my memory and a sizeable chunk of my swap as well. I couldn't find any reason for this. None of my Python code cached anything, and the sum total of all global variables did not amount to anything substantial.

Enter Pympler, a memory analyzer for Python. Pympler is an amazing library for introspecting your program's memory. Among its many features, it can list the biggest objects in your running program:

from pympler import muppy, summary
summary.print_(summary.summarize(muppy.get_objects()))
                                      types |   # objects |   total size
=========================================== | =========== | ============
                        <class 'array.array |        1076 |      2.77 GB
                                <class 'str |       42839 |      7.65 MB
                               <class 'dict |        8604 |      5.43 MB
                      <class 'numpy.ndarray |          48 |      3.16 MB
                               <class 'code |       14113 |      1.94 MB
                               <class 'type |        1557 |      1.62 MB
                               <class 'list |        3158 |      1.38 MB
                                <class 'set |        1265 |    529.72 KB
                              <class 'tuple |        5129 |    336.98 KB
                              <class 'bytes |        2413 |    219.48 KB
                            <class 'weakref |        2654 |    207.34 KB
            <class 'collections.OrderedDict |          65 |    149.85 KB
                 <class 'wrapper_descriptor |        1676 |    130.94 KB
  <class 'traitlets.traitlets.MetaHasTraits |         107 |    123.55 KB
                  <class 'getset_descriptor |        1738 |    122.20 KB

Now that is interesting. Apparently, I was lugging around close to three gigabytes worth of bare-Python array.array. And these are clearly not Numpy arrays, since those would show up as numpy.ndarray. But I couldn't find any of these objects in my workspace.

So let's get a reference to one of these objects, and see who they belong to. This can also be done with Pympler, but I prefer the way objgraph does it:

import array
# get a list of all objects known to Python:
all_objects = muppy.get_objects()
# sort out only `array.array` instances:
all_arrays = [obj for obj in all_objects if isinstance(obj, array.array)]

import objgraph
objgraph.show_backrefs(all_arrays[0], filename='array.png')

array.png

It seems that the array.array object is part of a matlab.double instance which is not referenced from anywhere but all_objects. A memory leak.

After a bit of experimentation, I found the culprit. To illustrate, here's an example: The function leak passes some data to Matlab, and calculates a float. Since the variables are not used outside of leak, and the function does not return anything, all variables within the function should get deallocated when leak returns.

def leak():
    test_data = numpy.zeros(1024*1024)
    matlab_data = matlab.double(test_data.tolist())
    eng.sum(matlab_data)

Pympler has another great feature that can track allocations. The SummaryTracker will track and display any allocations between calls to print_diff(). This is very useful to see how much memory was used during the call to leak:

from pympler import tracker
tr = tracker.SummaryTracker()
tr.print_diff()
leak()
tr.print_diff()
                     types |   # objects |   total size
========================== | =========== | ============
       <class 'array.array |           1 |      8.00 MB
...

And there you have it. Note that this leak is not the Numpy array test_data and it is not the matlab array matlab_data. Both of these are garbage collected correctly. But the Matlab Engine for Python will leak any data you pass to a Matlab function.

This data is not referenced from anywhere within Python, and is counted as leaked by objgraph. In other words, the C code inside the Matlab Engine for Python copies all passed data into it's internal memory, but never frees that memory. Not even if you quit the Matlab Engine, or del all Python references to it. Your only option is to restart Python.

Postscriptum

I since posted a bug report on Mathworks, and received a patch that fixes the problem. Additionally, Mathworks said that the problem only occurs on Linux.

Older posts