# The Style of Scientific Code

What does quality code look like? One common school of thought focuses on small, descriptive functions that take few arguments. To quote from Clean Code: "The first rule of functions is that they should be small", on the order of less than ten lines. "Functions should not be large enough to hold nested structures". "The ideal number of arguments for a function is zero [, one, or two]. Three [or more] arguments should be avoided where possible".

A few years ago, when I was working mostly on user interaction and data management, all of this made sense to me. What is the overhead of a few function calls and class lookups here and there if it makes the code more readable? In other words: Readability counts, and is usually more important than performance.

But lately, I have come to struggle with these rules. I am now writing a lot of scientific code, where algorithms are intrinsically complex beyond the syntactic complexity of the code. How do you "Express yourself in code [instead of comments]", when that code only consists of linear algebra and matrix multiplications?

def rectwin_spectrum(angular_frequency, specsize): """The spectrum of a rectangular window. [...]""" # In case of angular_frequency == 0, this will calculate NaN. Since # this will be corrected later, suppress the warning. with np.errstate(invalid='ignore'): spectrum = ( np.exp(-1j*angular_frequency*(specsize-1)/2) * np.sin(specsize*angular_frequency/2) / np.sin(angular_frequency/2) ) # since sin(x) == x for small x, the above expression # evaluates to specsize for angular_frequency == 0. spectrum[angular_frequency == 0.0] = specsize return spectrum

A lot of my scientific code ends up quite compact like that. Maybe a hundred lines of dense numeric expressions, plus a few hundred lines of explanations and documentation. The point is, scientific code often does not decompose into easily understood extractable functions.

On a related issue, how do you avoid long argument lists in heavily parametrized equations? As Clean Code states, "when a function seems to need more than two or three arguments, it is likely that some of those arguments ought to be wrapped in a class of their own". However, in Matlab in particular, it is quite unusual to create small one-trick classes to encapsulate a few function arguments:

classdef SignalBlocks < handle properties data samplerate blocksize hopsize end properties (Dependent) duration end methods function obj = SignalBlock(data, samplerate, blocksize, hopsize) % blocksize and hopsize are optional. What a mess. narginchk(2, 4); obj.data = data; obj.samplerate = samplerate; if nargin >= 3 obj.blocksize = blocksize; else obj.blocksize = 2048; end if nargin == 4 obj.hopsize = hopsize; else obj.hopsize = 1024; end end function time = get.duration(obj) time = length(obj.data)/obj.samplerate; end end end

This is not just cumbersome to write and maintain, it is also slower than passing `data`

, `samplerate`

, `blocksize`

, and `hopsize`

to each function call individually (although the overhead has gotten considerably smaller in newer versions of Matlab). Additionally, there is often a large performance benefit of *not* extracting every function and *not* keeping intermediate values in variables. Thus, it's not just readability that is hard to maintain in scientific code. Performance is a problem, too.

The sad thing is, I don't know the answer to these questions. There have been a lot of discussions about coding style and code quality in our department lately, with the clear objective to clean up our code. But common code quality criteria don't seem to apply to scientific code all that well.

Do you have any idea how to progress from here?

# Toren

I've been playing a lot of indie games lately. One of them has not been talked about much: Toren. Toren is a platformer about a girl that has to climb a tower to defeat a dragon and revive her world. This is probably the least polished game I have played in a long time. Animations are janky, controls are imprecise and clunky, and there are loads of little glitches. Yet, I really enjoyed this.

There is something about this world that feels honest to me: As you climb the tower, the child grows from a toddler to an adolescent, and is gradually introduced to more and more mature concepts. I didn't understand much of the iconography of this game, but it felt oddly cathartic to climb this tower of life, and overcome it's challenges.

I particularly liked how death played such an integral role in this story and some of the puzzles. The tower is a monument to a dead people, and yet the story and game mechanics are as much about dying as they are about rebirth and not giving up. This is underlined by the wonderful art style of this game, which contrasts vivid colors with brooding, dark architecture.

At just about two hours, Toren is not a long game. Instead of exploring one particular game mechanic, it mixes it's game up every few minutes. Every sequence looks different and beautiful, and yet it manages to tell a cohesive and effective story. ★★★★☆

# Books of 2015

## Among Others, by Jo Walton

I don't usually enjoy fantasy novels and their romantic escapism. I much prefer fascinating sci-fi thought experiments. But this book won all the most important awards, so I gave it a shot. What if random chance could be bent a little with creativity, the power of believing in something, and some mysticism? You end up with a world that is richer, more meaningful, and altogether more alive, if you just cared to observe and to appreciate it's beauty. Reading this book left me enchanted and more observant long after I put it down. What a wonderful book!

## A Son of the Circus, by John Irving

This is one of those books that was on my to-read list for months. It starts out as quirky and likeable as you would expect from John Irving. This time, we follow the tale of a Canadian/Indian doctor throughout his life, and his summer vacation in India. But this would not be John Irving if there weren't plenty of colourful characters, astute observations of human strangeness, and a meticulously crafted story. There is no scene in this book that does not serve a purpose, and so many moving parts my mind just boggles at the construction of it all. Yet at the same time, I was regularly laughing out loud. I loved every minute of this!

## The Windup Girl, by Paolo Bacigalupi

The whole world changed in the near future, when gasoline is a rare luxury, sea levels have risen and swallowed all the coastal cities, and man-made scourges have devastated most crops. And now it's not just humans that populate our urbanized world, but so too are our inventions, artificial humans called "windups" for their stutter-stop movements. But at the core, both humans and windups struggle for the same security, prosperity as ever. Such an inventive world, so much vivid creativity, social commentary, in this human struggle to not destroy ourselves.

## Zen and the art of motorcycle maintenance, by Robert Pirsig

This is about equal parts a motorcycle journey of a father and his son across the US, and a dive into another man's discoveries of philosophy. To be honest, I liked this book more for it's character descriptions and travelling adventures than it's philosophy. I am really conflicted about putting this book on this list at all, but I kept thinking about this long after I finished reading it, so I guess this had a bigger influence on me than I realized.

## Nexus/Crux/Apex, Rad/Blue/Green Mars, The Martian, The Three-Body Problem

What happens when you take today's world, and add nanotech brain upgrades (Nexus/Crux/Apex), or strand a lone scientist on Mars (The Martian), or send a large number of people to found a new colony on Mars (Red/Blue/Green Mars), or suddenly make contact with Aliens (The Three-Body Problem)? This what-if is what Science Fiction does best: Take this little what-if, and spin a gripping yarn from that. These books inspired me, made me think, were incredibly thrilling, but they did not have a lasting impact. Still well worth a read if you like Science Fiction, though.

# Calling Matlab from Python

For my latest experiments, I needed to run both Python functions and Matlab functions as part of the same program. As I noted earlier, Matlab includes the Matlab Engine for Python (MEfP), which can call Matlab functions from Python. Before I knew about this, I created Transplant, which does the very same thing. So, how do they compare?

## Usage

As it's name suggests, Matlab is a **mat**rix **lab**oratory, and matrices are the most important data type in Matlab. Since matrices don't exist in plain Python, the MEfP implements it's own as `matlab.double`

et al., and you have to convert any data you want to pass to Matlab into one of those. In contrast, Transplant recognizes the fact that Python does in fact know a really good matrix engine called Numpy, and just uses that instead.

Matlab Engine for Python | Transplant ---------------------------------------|--------------------------------------- import numpy | import numpy import matlab | import transplant import matlab.engine | | eng = matlab.engine.start_matlab() | eng = transplant.Matlab() numpy_data = numpy.random.randn(100) | numpy_data = numpy.random.randn(100) list_data = numpy_data.tolist() | matlab_data = matlab.double(list_data) | data_sum = eng.sum(matlab_data) | data_sum = eng.sum(numpy_data)

Aside from this difference, both libraries work almost identical. Even the handling of the number of output arguments is (accidentally) almost the same:

Matlab Engine for Python | Transplant ---------------------------------------|--------------------------------------- eng.max(matlab_data) | eng.max(numpy_data) >>> 4.533 | >>> [4.533 537635] eng.max(matlab_data, nargout=1) | eng.max(numpy_data, nargout=1) >>> 4.533 | >>> 4.533 eng.max(matlab_data, nargout=2) | eng.max(numpy_data, nargout=2) >>> (4.533, 537635.0) | >>> [4.533 537635]

Similarly, both libraries can interact with Matlab objects in Python, although the MEfP can't access object properties:

Matlab Engine for Python | Transplant ---------------------------------------|--------------------------------------- f = eng.figure() | f = eng.figure() eng.get(f, 'Position') | eng.get(f, 'Position') >>> matlab.double([[ ... ]]) | >>> array([[ ... ]]) f.Position | f.Position >>> AttributeError | >>> array([[ ... ]])

There are a few small differences, though:

- Function documentation in the MEfP is only available as
`eng.help('funcname')`

. Transplant will populate a function's`__doc__`

, and thus documentation tools like IPython's`?`

operator just work. - Transplant converts empty matrices to
`None`

, whereas the MEfP represents them as`matlab.double([])`

. - Transplant represents
`dict`

as`containers.Map`

, while the MEfP uses`struct`

(the former is more correct, the latter arguable more useful). - If the MEfP does not know
`nargout`

, it assumes`nargout=1`

. Transplant uses`nargout(func)`

or returns whatever the function writes into`ans`

. - The MEfP can't return non-scalar structs, such as the return value of
`whos`

. Transplant can do this. - The MEfP can't return anonymous functions, such as
`eng.eval('@(x, y) x>y')`

. Transplant can do this.

## Performance

The time to start a Matlab instance is shorter in MEfP (3.8 s) than in Transplant (6.1 s). But since you're doing this relatively seldomly, the difference typically doesn't matter too much.

More interesting is the time it takes to call a Matlab function from Python. Have a look:

This is running `sum(randn(n,1))`

from Transplant, the MEfP, and in Matlab itself. As you can see, the MEfP is a constant factor of about 1000 slower than Matlab. Transplant is a constant factor of about 100 slower than Matlab, but always takes at least 0.05 s.

There is a gap of about a factor of 10 between Transplant and the MEfP. In practice, this gap is highly significant! In my particular use case, I have a function that takes about one second of computation time for an audio signal of ten seconds (half a million values). When I call this function with Transplant, it takes about 1.3 seconds. With MEfP, it takes 4.5 seconds.

Transplant spends its time serializing the arguments to JSON, sending that JSON over ZeroMQ to Matlab, and parsing the JSON there. Well, to be honest, only the parsing part takes any significant time, overall. While it might seem onerous to serialize everything to JSON, this architecture allows Transplant to run over a network connection.

It is a bit baffling to me that MEfP manages to be slower than *that*, despite being written in C. Looking at the number of function calls in the profiler, the MEfP calls 25 functions (!) on each value (!!) of the input data. This is a shockingly inefficient way of doing things.

## TL;DR

It used to be very difficult to work in a mixed-language environment, particularly with one of those languages being Matlab. Nowadays, this has thankfully gotten much easier. Even Mathworks themselves have stepped up their game, and can interact with Python, C, Java, and FORTRAN. But their interface to Python does leave something to be desired, and there are better alternatives available.

If you want to try Transplant, just head over to Github and use it. If you find any bugs, feature requests, or improvements, please let me know in the Github issues.

# Massive Memory Leak in the Matlab Engine for Python

As of Matlab 2014b, Matlab includes a Python module for calling Matlab code from Python. This is how you use it:

import numpy import matlab import matlab.engine eng = matlab.engine.start_matlab() random_data = numpy.random.randn(100) # convert Numpy data to Matlab: matlab_data = matlab.double(random_data.tolist()) data_sum = eng.sum(matlab_data)

You can call any Matlab function on `eng`

, and you can access any Matlab workspace variable in `eng.workspace`

. As you can see, the Matlab Engine is not Numpy-aware, and you have to convert all your Numpy data to Matlab `double`

before you can call Matlab functions with it. Still, it works pretty well.

Recently, I ran a rather large experiment set, where I had a set of four functions, two in Matlab, two in Python, and called each of these functions a few thousand times with a bunch of different data to see how they performed.

While doing that I noticed that my Python processes were growing larger and larger, until they consumed all my memory and a sizeable chunk of my swap as well. I couldn't find any reason for this. None of my Python code cached anything, and the sum total of all global variables did not amount to anything substantial.

Enter Pympler, a memory analyzer for Python. Pympler is an amazing library for introspecting your program's memory. Among its many features, it can list the biggest objects in your running program:

from pympler import muppy, summary summary.print_(summary.summarize(muppy.get_objects()))

types | # objects | total size =========================================== | =========== | ============ <class 'array.array | 1076 | 2.77 GB <class 'str | 42839 | 7.65 MB <class 'dict | 8604 | 5.43 MB <class 'numpy.ndarray | 48 | 3.16 MB <class 'code | 14113 | 1.94 MB <class 'type | 1557 | 1.62 MB <class 'list | 3158 | 1.38 MB <class 'set | 1265 | 529.72 KB <class 'tuple | 5129 | 336.98 KB <class 'bytes | 2413 | 219.48 KB <class 'weakref | 2654 | 207.34 KB <class 'collections.OrderedDict | 65 | 149.85 KB <class 'wrapper_descriptor | 1676 | 130.94 KB <class 'traitlets.traitlets.MetaHasTraits | 107 | 123.55 KB <class 'getset_descriptor | 1738 | 122.20 KB

Now that is interesting. Apparently, I was lugging around close to three gigabytes worth of bare-Python `array.array`

. And these are clearly not Numpy arrays, since those would show up as `numpy.ndarray`

. But I couldn't find any of these objects in my workspace.

So let's get a reference to one of these objects, and see who they belong to. This can also be done with Pympler, but I prefer the way objgraph does it:

import array # get a list of all objects known to Python: all_objects = muppy.get_objects() # sort out only `array.array` instances: all_arrays = [obj for obj in all_objects if isinstance(obj, array.array)] import objgraph objgraph.show_backrefs(all_arrays[0], filename='array.png')

It seems that the `array.array`

object is part of a `matlab.double`

instance which is not referenced from anywhere but `all_objects`

. A memory leak.

After a bit of experimentation, I found the culprit. To illustrate, here's an example: The function `leak`

passes some data to Matlab, and calculates a float. Since the variables are not used outside of `leak`

, and the function does not return anything, all variables within the function should get deallocated when `leak`

returns.

def leak(): test_data = numpy.zeros(1024*1024) matlab_data = matlab.double(test_data.tolist()) eng.sum(matlab_data)

Pympler has another great feature that can track allocations. The `SummaryTracker`

will track and display any allocations between calls to `print_diff()`

. This is very useful to see how much memory was used during the call to `leak`

:

from pympler import tracker tr = tracker.SummaryTracker() tr.print_diff() leak() tr.print_diff()

types | # objects | total size ========================== | =========== | ============ <class 'array.array | 1 | 8.00 MB ...

And there you have it. Note that this leak is not the Numpy array `test_data`

and it is not the matlab array `matlab_data`

. Both of these are garbage collected correctly. But **the Matlab Engine for Python will leak any data you pass to a Matlab function**.

This data is not referenced from anywhere within Python, and is counted as *leaked* by `objgraph`

. In other words, the C code inside the Matlab Engine for Python copies all passed data into it's internal memory, but never frees that memory. Not even if you quit the Matlab Engine, or `del`

all Python references to it. Your only option is to restart Python.

**Postscriptum**

I since posted a bug report on Mathworks, and received a patch that fixes the problem. Additionally, Mathworks said that the problem only occurs on Linux.