Basti's Scratchpad on the Internet
03 Oct 2015

Changing File Creation Dates in OSX

On my last vacation, I have taken a bunch of pictures, and a bunch of video. The problem is, I hadn't used the video camera in a long time, and it believed that all it's videos were taken on the first of January 2012. So in order for the pictures to show up correctly in my picture library, I wanted to correct that.

For images, this is relatively easy: Most picture libraries support some kind of bulk date changes, and there are a bunch of command line utilities that can do it, too. But none of these tools work for video (exiftool claims be able to do that, but I couldn't get it to work).

So instead, I went about to change the file creation date of the actual video files. And it turns out, this is surprisingly hard! The thing is, most Unix systems (a Mac is technically a Unix system) don't even know the concept of a file creation date. Thus, most Unix utilities, including most programming languages, don't know how to deal with that, either.

If you have XCode installed, this will come with SetFile, a command line utility that can change file creation dates. Note that SetFile can change either the file creation date, or the file modification date, but not both at the same time, as any normal Unix utility would. Also note that SetFile expects dates in American notation, which is about as nonsensical as date formats come.

Anyway, here's a small Python script that changes the file creation date (but not the time) of a bunch of video files:

import os.path
import os
import datetime
# I want to change the dates on the files GOPR0246.MP4-GOPR0264.MP4
for index in range(426, 465):
    filename = 'GOPR0{}.MP4'.format(index)
    # extract old date:
    date = datetime.datetime.fromtimestamp(os.path.getctime(filename))
    # create a new date with the same time, but on 2015-08-22
    new_date = datetime.datetime(2015,  8, 22, date.hour, date.minute, date.second)
    # set the file creation date with the "-d" switch, which presumably stands for "dodification"
    os.system('SetFile -d "{}" {}'.format(new_date.strftime('%m/%d/%Y %H:%M:%S'), filename))
    # set the file modification date with the "-m" switch
    os.system('SetFile -m "{}" {}'.format(new_date.strftime('%m/%d/%Y %H:%M:%S'), filename))
29 Sep 2015

Numpy Broadcasting Rules

They say that all arithmetic operations in Numpy behave like their element-wise cousins in Matlab. This is wrong, and seriously tripped me up last week.

In particular, this is what happens when you multiply an array with a matrix1 in Numpy:

     [[  1],           [[1, 2, 3],       [[ 1,    2,   3],
      [ 10],       *    [4, 5, 6],   =    [ 40,  50,  60],
      [100]]            [7, 8, 9]]        [700, 800, 900]]

 [  1,  10, 100]       [[1, 2, 3],       [[  1,  20, 300],
        OR         *    [4, 5, 6],   =    [  4,  50, 600],
[[  1,  10, 100]]       [7, 8, 9]]        [  7,  80, 900]]

They behave as if each row was evaluated separately, and singular dimensions are repeated where necessary. It helps to think about them as row-wise, instead of element-wise. This is particularly important in the second example, where the whole 1d-array is multiplied with every row of the 2d-array.

Note that this is not equivalent to multiplying every element as in [a[n]*b[n] for n in range(len(a))]. I guess that's why this is called broadcasting, and not element-wise.



"matrix" here refers to a 2-d numpy.array. There is also a numpy.matrix, where multiplication is matrix multiplication, but this is not what I'm talking about.

28 Sep 2015

Python Numeric Performance

Recently, I was working on a dynamic programming algorithm that involves a lot of number crunching in nested loops. The algorithm looks like this:

def y_change_probability_python(oct_per_sec):
    """ ... """
    b = 1.781
    mu = -0.301
    return 1/(2*b)*math.exp(-abs(oct_per_sec-mu)/b)

def y_idx_range_python(y_idx, max_y_factor, height):
    """ ... """
    y = (y_idx/height)*(max_y-min_y)+min_y
    y_lo = max(y/max_y_factor, min_y)
    y_hi = min(y*max_y_factor, max_y)
    y_lo_idx = int((y_lo-min_y)/(max_y-min_y)*height)
    y_hi_idx = int((y_hi-min_y)/(max_y-min_y)*height)
    return y_lo_idx, y_hi_idx

def find_tracks_python(cost_matrix, delta_x):
    """ ... """
    tracks = np.zeros(correlogram.shape, dtype=np.int64)
    cum_cost = np.zeros(correlogram.shape)

    max_y_factor = (2**5)**(delta_x)
    height = correlogram.shape[1]

    probabilities = np.empty((height, height))
    for y_idx in range(height):
        y = (y_idx/height)*(max_y-min_y)+min_y
        for y_pre_idx in range(*y_idx_range_numba(y_idx, max_y_factor, height)):
            y_pre = (y_pre_idx/height)*(max_y-min_y)+min_y
            doubles_per_x = math.log2((y/y_pre)**(1/delta_x))
            probabilities[y_idx, y_pre_idx] = y_change_probability_numba(doubles_per_x)

    for x_idx, cost_column in enumerate(cost_matrix):
        if x_idx == 0:
            cum_cost[x_idx] = cost_column
        for y_idx, cost in enumerate(cost_column):
            for y_pre_idx in range(*y_idx_range_numba(y_idx, max_y_factor, height)):
                weighted_cum_cost = cum_cost[x_idx-1, y_pre_idx] + cost*probabilities[y_idx, y_pre_idx]
                if weighted_cum_cost > cum_cost[x_idx, y_idx]:
                    cum_cost[x_idx, y_idx] = weighted_cum_cost
                    tracks[x_idx, y_idx] = y_pre_idx
            cum_cost[x_idx, y_idx] = cum_cost[x_idx-1, tracks[x_idx, y_idx]] + cost

    return tracks, cum_cost

I'm not going into the details of what this algorithm does, but note that it iterates over every column and row of the matrix cost_matrix, and then iterates over another range previous_y_range for each of the values in cost_matrix. On the way, it does a lot of basic arithmetic and some algebra.

The problem is, this is very slow. For a \(90 \times 200\) cost_matrix, this takes about 260 ms. Lots of loops? Lots of simple mathematics? Slow? That sounds like a perfect match for Numpy!

If you can express your code in terms of linear algebra, Numpy will execute them in highly-optimized C code. The problem is, translating loops into linear algebra is not always easy. In this case, it took some effort:

def y_change_probability_numpy(doubles_per_x):
    """ ... """
    b = 1.781
    mu = -0.301
    return 1/(2*b)*np.exp(-np.abs(doubles_per_x-mu)/b)

def find_frequency_tracks_numpy(cost_matrix, delta_x):
    """ ... """
    tracks = np.zeros(cost_matrix.shape,
    cum_cost = np.zeros(cost_matrix.shape)

    max_y_factor = (2**5)**(delta_t) # allow at most 5 octaves per second (3 sigma)
    height = cost_matrix.shape[1]

    # pre-allocate probabilities matrix as minus infinity. This matrix
    # will be sparsely filled with positive probability values, and
    # empty values will have minus infinite probability.
    probabilities = -np.ones((height, height))*np.inf
    for y_idx in range(probabilities.shape[0]):
        y = (y_idx/height)*(max_y-min_y)+min_y
        y_pre_idx = np.arange(int((max(y/max_y_factor, min_y)-min_y)/(max_y-min_y)*height),
                              int((min(y*max_y_factor, max_y)-min_y)/(max_y-min_y)*height))
        y_pre = (y_pre_idx/height)*(max_y-min_y)+min_y
        doubles_per_x = np.log2((y/y_pre)**(1/delta_x))
        probabilities[y_idx, y_pre_idx] = y_change_probability(doubles_per_x)

    cum_cost[0] = cost_matrix[0]
    for x_idx in range(1, len(cost_matrix)):
        cost_column = cost_matrix[x_idx:x_idx+1] # extract cost_column as 2d-vector!
        weighted_cum_cost = cum_cost[x_idx-1] + cost_column.T*probabilities
        tracks[x_idx] = np.argmax(weighted_cum_cost, axis=1)
        cum_cost[x_idx] = cum_cost[x_idx-1, tracks[x_idx]] + cost_column

    return tracks, cum_corrs

This code does not look much like the original, but calculates exactly the same thing. This takes about 15 ms for a \(90 \times 200\) cost_matrix, which is about 17 times faster than the original code! Yay Numpy! And furthermore, this code is arguably more readable than the original, since it is written at a higher level of abstraction.

Another avenue for performance optimization is Numba. Numba applies dark and powerful magic to compile humble Python functions into blazingly fast machine code. It is proper magic, if you ask me. Simply add an innocuous little decorator to your functions, and let Numba do it's thing. If all goes well, your code will work just as before, except with unheard-of performance:

@jit(numba.float64(numba.float64), nopython=True)
def y_change_probability_numba(doubles_per_x):

@jit((numba.int64, numba.float64, numba.int64), nopython=True)
def y_idx_range_numba(y_idx, max_y_factor, height):

@jit((numba.float64[:,:], numba.float64), nopython=True)
def find_tracks_numba(cost_matrix, delta_t):

However, Numba is no silver bullet, and does not support all of Python yet. In the present case, it is missing support for enumerate for Numpy matrices. Thus, I had to rewrite the first two loops like this:

# python version
for x_idx, cost_column in enumerate(cost_matrix):

# numba version
for x_idx in range(len(cost_matrix)):
    cost_column = cost_matrix[x_idx]

Another area that proved problematic is N-D slice writing. Instead of using expressions like m1[x,y:y+3] = m2, you have to write for idx in range(3): m1[x,y+idx] = m2[idx]. Not a difficult transformation, but it basically forced me to unroll all the nice vectorized code of the Numpy version back to their original pure-Python form. That said, Numba is getting better and better, and many constructs that used to be uncompilable (yield) are not a problem any more.

Anyway, with that done, the above code went down from 260 ms to 2.2 ms. This is a 120-fold increase in performance, and still seven times faster than Numpy, with minimal code changes. This is proper magic!

So why wouldn't you just always use Numba? After all, when it comes down to raw performance, Numba is the clear winner. The big difference between performance optimization using Numpy and Numba is that properly vectorizing your code for Numpy often reveals simplifications and abstractions that make it easier to reason about your code. Numpy forces you to think in terms of vectors, matrices, and linear algebra, and this often makes your code more beautiful. Numba on the other hand often requires you to make your code less beautiful to conform to it's subset of compilable Python.

22 Apr 2015

Matlab and Audio Files

So I wanted to work with audio files in Matlab. In the past, Matlab could only do this with auread and wavread, which can read *.au and *.wav files. With 2012b, Matlab introduced audioread, which claims to support *.wav, *.ogg, *.flac, *.au, *.mp3, and *.mp4, and simultaneously deprecated auread and wavread.

Of these file formats, only *.au is capable of storing more than 4 Gb of audio data. But the documentation is actually wrong: audioread can actually read more data formats than documented: it reads *.w64, *.rf64, and *.caf no problem. And these can store more than 4 Gb as well.

It's just that, while audioread supports all of these nice file formats, audiowrite is more limited, and only supports *.wav, *.ogg, *.flac, and *.mp4. And it does not support any undocumented formats, either. So it seems that there is no way of writing files larger than 4 Gb. But for the time being, auwrite is still available, even though deprecated. I tried it, though, and it didn't finish writing 4.8 Gb in half an hour.

In other words, Matlab is incapable of writing audio files larger than 4 Gb. It just can't do it.

15 Apr 2015

Unicode and Matlab on the command line

As per the latest Stackoverflow Developer Survey, Matlab is one of the most dreaded tools out there. I run into Matlab-related trouble daily. In all honesty, I have never seen a programming language as user-hostile and as badly designed as this.

So here is today's problem: When run from the command line, Matlab does not render unicode characters (on OSX).

I say "(on OSX)", because on Windows, it does not print a damn thing. Nope, no disp output for Windows users.

More analysis: It's not that Matlab does not render unicode characters at all when run from the command line. Instead, it renders them as 0x1a aka SUB aka substitute character. In other words, it tries to render unicode as ASCII (which doesn't work), and then replaces all non-ASCII characters with SUB. This is actually reasonable if Matlab were running on a machine that can't handle unicode. This is not a correct assessment of post-90s Macs, though.

To see why Matlab would do such a dastardly deed, you can use feature('locale') to get information about the encoding Matlab uses. On Windows and OS X, this defaults to either ISO-8859-1 (when your locale is pure de_DE or en_US) or US-ASCII, if it is something impure. In my case, German dates but English text. Because US-ASCII is obviously the most all-encompassing choice for such mixed-languages environments.

But luckily, there is help. Matlab has a widely documented (not) and easily discoverable (not) configuration option to change this: To change Matlab's encoding settings, edit %MATLABROOT%/bin/lcdata.xml, and look for the entry for your locale. For me, this is one of

<locale name="de_DE" encoding="ISO-8859-1" xpg_name="de_DE.ISO8859-1"> ...
<locale name="en_US" encoding="ISO-8859-1" xpg_name="en_US.ISO8859-1"> ...

In order to make Matlab's encoding default to UTF-8, change the entry for your locale to

<locale name="de_DE" encoding="UTF-8" xpg_name="de_DE.UTF-8"> ...
<locale name="en_US" encoding="UTF-8" xpg_name="en_US.UTF-8"> ...

With that, Matlab will print UTF-8 to the terminal.

You still can't type unicode characters to the command prompt, of course. But who would want that anyway, I dare ask. Of course, what with Matlab being basically free, and frequently updated, we can forgive such foibles easily…

Older posts