03 Nov 2015

Calling Matlab from Python

For my latest experiments, I needed to run both Python functions and Matlab functions as part of the same program. As I noted earlier, Matlab includes the Matlab Engine for Python (MEfP), which can call Matlab functions from Python. Before I knew about this, I created Transplant, which does the very same thing. So, how do they compare?

Usage

As it's name suggests, Matlab is a matrix laboratory, and matrices are the most important data type in Matlab. Since matrices don't exist in plain Python, the MEfP implements it's own as matlab.double et al., and you have to convert any data you want to pass to Matlab into one of those. In contrast, Transplant recognizes the fact that Python does in fact know a really good matrix engine called Numpy, and just uses that instead.

       Matlab Engine for Python        |              Transplant
---------------------------------------|---------------------------------------
import numpy                           | import numpy
import matlab                          | import transplant
import matlab.engine                   |
|
eng = matlab.engine.start_matlab()     | eng = transplant.Matlab()
numpy_data = numpy.random.randn(100)   | numpy_data = numpy.random.randn(100)
list_data = numpy_data.tolist()        |
matlab_data = matlab.double(list_data) |
data_sum = eng.sum(matlab_data)        | data_sum = eng.sum(numpy_data)


Aside from this difference, both libraries work almost identical. Even the handling of the number of output arguments is (accidentally) almost the same:

       Matlab Engine for Python        |              Transplant
---------------------------------------|---------------------------------------
eng.max(matlab_data)                   | eng.max(numpy_data)
>>> 4.533                              | >>> [4.533 537635]
eng.max(matlab_data, nargout=1)        | eng.max(numpy_data, nargout=1)
>>> 4.533                              | >>> 4.533
eng.max(matlab_data, nargout=2)        | eng.max(numpy_data, nargout=2)
>>> (4.533, 537635.0)                  | >>> [4.533 537635]


Similarly, both libraries can interact with Matlab objects in Python, although the MEfP can't access object properties:

       Matlab Engine for Python        |              Transplant
---------------------------------------|---------------------------------------
f = eng.figure()                       | f = eng.figure()
eng.get(f, 'Position')                 | eng.get(f, 'Position')
>>> matlab.double([[ ... ]])           | >>> array([[ ... ]])
f.Position                             | f.Position
>>> AttributeError                     | >>> array([[ ... ]])


There are a few small differences, though:

• Function documentation in the MEfP is only available as eng.help('funcname'). Transplant will populate a function's __doc__, and thus documentation tools like IPython's ? operator just work.
• Transplant converts empty matrices to None, whereas the MEfP represents them as matlab.double([]).
• Transplant represents dict as containers.Map, while the MEfP uses struct (the former is more correct, the latter arguable more useful).
• If the MEfP does not know nargout, it assumes nargout=1. Transplant uses nargout(func) or returns whatever the function writes into ans.
• The MEfP can't return non-scalar structs, such as the return value of whos. Transplant can do this.
• The MEfP can't return anonymous functions, such as eng.eval('@(x, y) x>y'). Transplant can do this.

Performance

The time to start a Matlab instance is shorter in MEfP (3.8 s) than in Transplant (6.1 s). But since you're doing this relatively seldomly, the difference typically doesn't matter too much.

More interesting is the time it takes to call a Matlab function from Python. Have a look:

This is running sum(randn(n,1)) from Transplant, the MEfP, and in Matlab itself. As you can see, the MEfP is a constant factor of about 1000 slower than Matlab. Transplant is a constant factor of about 100 slower than Matlab, but always takes at least 0.05 s.

There is a gap of about a factor of 10 between Transplant and the MEfP. In practice, this gap is highly significant! In my particular use case, I have a function that takes about one second of computation time for an audio signal of ten seconds (half a million values). When I call this function with Transplant, it takes about 1.3 seconds. With MEfP, it takes 4.5 seconds.

Transplant spends its time serializing the arguments to JSON, sending that JSON over ZeroMQ to Matlab, and parsing the JSON there. Well, to be honest, only the parsing part takes any significant time, overall. While it might seem onerous to serialize everything to JSON, this architecture allows Transplant to run over a network connection.

It is a bit baffling to me that MEfP manages to be slower than that, despite being written in C. Looking at the number of function calls in the profiler, the MEfP calls 25 functions (!) on each value (!!) of the input data. This is a shockingly inefficient way of doing things.

TL;DR

It used to be very difficult to work in a mixed-language environment, particularly with one of those languages being Matlab. Nowadays, this has thankfully gotten much easier. Even Mathworks themselves have stepped up their game, and can interact with Python, C, Java, and FORTRAN. But their interface to Python does leave something to be desired, and there are better alternatives available.

If you want to try Transplant, just head over to Github and use it. If you find any bugs, feature requests, or improvements, please let me know in the Github issues.

29 Okt 2015

Massive Memory Leak in the Matlab Engine for Python

As of Matlab 2014b, Matlab includes a Python module for calling Matlab code from Python. This is how you use it:

import numpy
import matlab
import matlab.engine

eng = matlab.engine.start_matlab()
random_data = numpy.random.randn(100)
# convert Numpy data to Matlab:
matlab_data = matlab.double(random_data.tolist())
data_sum = eng.sum(matlab_data)


You can call any Matlab function on eng, and you can access any Matlab workspace variable in eng.workspace. As you can see, the Matlab Engine is not Numpy-aware, and you have to convert all your Numpy data to Matlab double before you can call Matlab functions with it. Still, it works pretty well.

Recently, I ran a rather large experiment set, where I had a set of four functions, two in Matlab, two in Python, and called each of these functions a few thousand times with a bunch of different data to see how they performed.

While doing that I noticed that my Python processes were growing larger and larger, until they consumed all my memory and a sizeable chunk of my swap as well. I couldn't find any reason for this. None of my Python code cached anything, and the sum total of all global variables did not amount to anything substantial.

Enter Pympler, a memory analyzer for Python. Pympler is an amazing library for introspecting your program's memory. Among its many features, it can list the biggest objects in your running program:

from pympler import muppy, summary
summary.print_(summary.summarize(muppy.get_objects()))

                                      types |   # objects |   total size
=========================================== | =========== | ============
<class 'array.array |        1076 |      2.77 GB
<class 'str |       42839 |      7.65 MB
<class 'dict |        8604 |      5.43 MB
<class 'numpy.ndarray |          48 |      3.16 MB
<class 'code |       14113 |      1.94 MB
<class 'type |        1557 |      1.62 MB
<class 'list |        3158 |      1.38 MB
<class 'set |        1265 |    529.72 KB
<class 'tuple |        5129 |    336.98 KB
<class 'bytes |        2413 |    219.48 KB
<class 'weakref |        2654 |    207.34 KB
<class 'collections.OrderedDict |          65 |    149.85 KB
<class 'wrapper_descriptor |        1676 |    130.94 KB
<class 'traitlets.traitlets.MetaHasTraits |         107 |    123.55 KB
<class 'getset_descriptor |        1738 |    122.20 KB


Now that is interesting. Apparently, I was lugging around close to three gigabytes worth of bare-Python array.array. And these are clearly not Numpy arrays, since those would show up as numpy.ndarray. But I couldn't find any of these objects in my workspace.

So let's get a reference to one of these objects, and see who they belong to. This can also be done with Pympler, but I prefer the way objgraph does it:

import array
# get a list of all objects known to Python:
all_objects = muppy.get_objects()
# sort out only array.array instances:
all_arrays = [obj for obj in all_objects if isinstance(obj, array.array)]

import objgraph
objgraph.show_backrefs(all_arrays[0], filename='array.png')


It seems that the array.array object is part of a matlab.double instance which is not referenced from anywhere but all_objects. A memory leak.

After a bit of experimentation, I found the culprit. To illustrate, here's an example: The function leak passes some data to Matlab, and calculates a float. Since the variables are not used outside of leak, and the function does not return anything, all variables within the function should get deallocated when leak returns.

def leak():
test_data = numpy.zeros(1024*1024)
matlab_data = matlab.double(test_data.tolist())
eng.sum(matlab_data)


Pympler has another great feature that can track allocations. The SummaryTracker will track and display any allocations between calls to print_diff(). This is very useful to see how much memory was used during the call to leak:

from pympler import tracker
tr = tracker.SummaryTracker()
tr.print_diff()
leak()
tr.print_diff()

                     types |   # objects |   total size
========================== | =========== | ============
<class 'array.array |           1 |      8.00 MB
...


And there you have it. Note that this leak is not the Numpy array test_data and it is not the matlab array matlab_data. Both of these are garbage collected correctly. But the Matlab Engine for Python will leak any data you pass to a Matlab function.

This data is not referenced from anywhere within Python, and is counted as leaked by objgraph. In other words, the C code inside the Matlab Engine for Python copies all passed data into it's internal memory, but never frees that memory. Not even if you quit the Matlab Engine, or del all Python references to it. Your only option is to restart Python.

Postscriptum

I since posted a bug report on Mathworks, and received a patch that fixes the problem. Additionally, Mathworks said that the problem only occurs on Linux.

16 Okt 2015

OS X Finder Woes

The Mac. It used to be the most streamlined, thought-through general computing device on the market.

Even it's file management used to be top-notch. There were many cool little touches. One particularly useful feature was the Proxy Icon–if a window displayed a file's content, that file's icon would show up in the window's title. And you could drag that icon directly onto a thumb drive or email, without having to use the Finder. But the Finder, too, had many neat little features. I loved the fact that when you renamed a file in an alphabetically sorted file list, Finder would not immediately re-shuffle it to its new location, but would wait half a second before doing so. When renaming multiple files, this was really useful, since you could go through them one by one and rename them, simply by pressing arrow keys and return.

But as you might have guessed from my use of the past tense, these golden days are gone. The Finder used to know a JPEG from a ZIP regardless of file extension. Now it doesn't any more. The Proxy Icon is still draggable, but it will create an alias instead of a copy–perfectly useless on a thumb drive or in an email.

And with the newest version of OS X, El Capitan, they finally blew it for me. Before, even though the Finder inexplicably never had the ability to cut and paste files, you could always install programs like TotalFinder to fix that. Not so with El Capitan. The Finder now is holy land, and can not be touched any more by third parties. So no more cut and paste, no more un-hiding system files. No more side-by-side Finder tabs. And brand new with El Capitan as well: No more waiting after renaming. Now, when you rename a file, it is immediately re-sorted to its new position, thus making renaming multiple files terribly inconvenient.

So, good bye OS X. I updated my work laptop first, and I regret it. I never regretted an OS X update before. My home machine is not going to get the update. It is honestly sad to see my once-beloved Mac platform becoming worse and worse and worse with every new release.

03 Okt 2015

Changing File Creation Dates in OSX

On my last vacation, I have taken a bunch of pictures, and a bunch of video. The problem is, I hadn't used the video camera in a long time, and it believed that all it's videos were taken on the first of January 2012. So in order for the pictures to show up correctly in my picture library, I wanted to correct that.

For images, this is relatively easy: Most picture libraries support some kind of bulk date changes, and there are a bunch of command line utilities that can do it, too. But none of these tools work for video (exiftool claims be able to do that, but I couldn't get it to work).

So instead, I went about to change the file creation date of the actual video files. And it turns out, this is surprisingly hard! The thing is, most Unix systems (a Mac is technically a Unix system) don't even know the concept of a file creation date. Thus, most Unix utilities, including most programming languages, don't know how to deal with that, either.

If you have XCode installed, this will come with SetFile, a command line utility that can change file creation dates. Note that SetFile can change either the file creation date, or the file modification date, but not both at the same time, as any normal Unix utility would. Also note that SetFile expects dates in American notation, which is about as nonsensical as date formats come.

Anyway, here's a small Python script that changes the file creation date (but not the time) of a bunch of video files:

import os.path
import os
import datetime
# I want to change the dates on the files GOPR0246.MP4-GOPR0264.MP4
for index in range(426, 465):
filename = 'GOPR0{}.MP4'.format(index)
# extract old date:
date = datetime.datetime.fromtimestamp(os.path.getctime(filename))
# create a new date with the same time, but on 2015-08-22
new_date = datetime.datetime(2015,  8, 22, date.hour, date.minute, date.second)
# set the file creation date with the "-d" switch, which presumably stands for "dodification"
os.system('SetFile -d "{}" {}'.format(new_date.strftime('%m/%d/%Y %H:%M:%S'), filename))
# set the file modification date with the "-m" switch
os.system('SetFile -m "{}" {}'.format(new_date.strftime('%m/%d/%Y %H:%M:%S'), filename))

29 Sep 2015

Numpy Broadcasting Rules

They say that all arithmetic operations in Numpy behave like their element-wise cousins in Matlab. This is wrong, and seriously tripped me up last week.

In particular, this is what happens when you multiply an array with a matrix1 in Numpy:

     [[  1],           [[1, 2, 3],       [[ 1,    2,   3],
[ 10],       *    [4, 5, 6],   =    [ 40,  50,  60],
[100]]            [7, 8, 9]]        [700, 800, 900]]

[  1,  10, 100]       [[1, 2, 3],       [[  1,  20, 300],
OR         *    [4, 5, 6],   =    [  4,  50, 600],
[[  1,  10, 100]]       [7, 8, 9]]        [  7,  80, 900]]


They behave as if each row was evaluated separately, and singular dimensions are repeated where necessary. It helps to think about them as row-wise, instead of element-wise. This is particularly important in the second example, where the whole 1d-array is multiplied with every row of the 2d-array.

Note that this is not equivalent to multiplying every element as in [a[n]*b[n] for n in range(len(a))]. I guess that's why this is called broadcasting, and not element-wise.

Footnotes:

1

"matrix" here refers to a 2-d numpy.array. There is also a numpy.matrix, where multiplication is matrix multiplication, but this is not what I'm talking about.