29 Oct 2015

Massive Memory Leak in the Matlab Engine for Python

As of Matlab 2014b, Matlab includes a Python module for calling Matlab code from Python. This is how you use it:

import numpy
import matlab
import matlab.engine

eng = matlab.engine.start_matlab()
random_data = numpy.random.randn(100)
# convert Numpy data to Matlab:
matlab_data = matlab.double(random_data.tolist())
data_sum = eng.sum(matlab_data)

You can call any Matlab function on eng, and you can access any Matlab workspace variable in eng.workspace. As you can see, the Matlab Engine is not Numpy-aware, and you have to convert all your Numpy data to Matlab double before you can call Matlab functions with it. Still, it works pretty well.

Recently, I ran a rather large experiment set, where I had a set of four functions, two in Matlab, two in Python, and called each of these functions a few thousand times with a bunch of different data to see how they performed.

While doing that I noticed that my Python processes were growing larger and larger, until they consumed all my memory and a sizeable chunk of my swap as well. I couldn't find any reason for this. None of my Python code cached anything, and the sum total of all global variables did not amount to anything substantial.

Enter Pympler, a memory analyzer for Python. Pympler is an amazing library for introspecting your program's memory. Among its many features, it can list the biggest objects in your running program:

from pympler import muppy, summary
summary.print_(summary.summarize(muppy.get_objects()))

                                      types |   # objects |   total size
=========================================== | =========== | ============
                        <class 'array.array |        1076 |      2.77 GB
                                <class 'str |       42839 |      7.65 MB
                               <class 'dict |        8604 |      5.43 MB
                      <class 'numpy.ndarray |          48 |      3.16 MB
                               <class 'code |       14113 |      1.94 MB
                               <class 'type |        1557 |      1.62 MB
                               <class 'list |        3158 |      1.38 MB
                                <class 'set |        1265 |    529.72 KB
                              <class 'tuple |        5129 |    336.98 KB
                              <class 'bytes |        2413 |    219.48 KB
                            <class 'weakref |        2654 |    207.34 KB
            <class 'collections.OrderedDict |          65 |    149.85 KB
                 <class 'wrapper_descriptor |        1676 |    130.94 KB
  <class 'traitlets.traitlets.MetaHasTraits |         107 |    123.55 KB
                  <class 'getset_descriptor |        1738 |    122.20 KB

Now that is interesting. Apparently, I was lugging around close to three gigabytes worth of bare-Python array.array. And these are clearly not Numpy arrays, since those would show up as numpy.ndarray. But I couldn't find any of these objects in my workspace.

So let's get a reference to one of these objects, and see who they belong to. This can also be done with Pympler, but I prefer the way objgraph does it:

import array
# get a list of all objects known to Python:
all_objects = muppy.get_objects()
# sort out only `array.array` instances:
all_arrays = [obj for obj in all_objects if isinstance(obj, array.array)]

import objgraph
objgraph.show_backrefs(all_arrays[0], filename='array.png')

It seems that the array.array object is part of a matlab.double instance which is not referenced from anywhere but all_objects. A memory leak.

After a bit of experimentation, I found the culprit. To illustrate, here's an example: The function leak passes some data to Matlab, and calculates a float. Since the variables are not used outside of leak, and the function does not return anything, all variables within the function should get deallocated when leak returns.

def leak():
    test_data = numpy.zeros(1024*1024)
    matlab_data = matlab.double(test_data.tolist())
    eng.sum(matlab_data)

Pympler has another great feature that can track allocations. The SummaryTracker will track and display any allocations between calls to print_diff(). This is very useful to see how much memory was used during the call to leak:

from pympler import tracker
tr = tracker.SummaryTracker()
tr.print_diff()
leak()
tr.print_diff()

                     types |   # objects |   total size
========================== | =========== | ============
       <class 'array.array |           1 |      8.00 MB
...

And there you have it. Note that this leak is not the Numpy array test_data and it is not the matlab array matlab_data. Both of these are garbage collected correctly. But the Matlab Engine for Python will leak any data you pass to a Matlab function.

This data is not referenced from anywhere within Python, and is counted as leaked by objgraph. In other words, the C code inside the Matlab Engine for Python copies all passed data into it's internal memory, but never frees that memory. Not even if you quit the Matlab Engine, or del all Python references to it. Your only option is to restart Python.

Postscriptum

I since posted a bug report on Mathworks, and received a patch that fixes the problem. Additionally, Mathworks said that the problem only occurs on Linux.