A few days ago, a new version of Darktable was released. Darktable is an Open Source RAW image processor, and one of my favorite pieces of software of all time. It may be THE most powerful RAW processor on the market, far exceeding its proprietary brothers. But in the social media discussions following the release, it got dismissed for being unintuitive.
Which is silly, ironic, sad; why should Intuition be a measure of quality for professional software? Because in contrast to a transient app on your phone, RAW processing is a complex task that requires months of study to do well. Like most nontrivial things, it is fundamentally not an intuitive process. In fact, most of my favorite tools are like this: Efficient, but not Intuitive; like Darktable, Scientific Python, the Unix Command Line, Emacs.
These tools empower me to do my job quickly and directly. They provide sharp tools for intelligent users, which can be composed easily and infinitely. Their power derives from clean metaphors and clear semantics, instead of "intuitive", "smart", magics. They require learning to use well, but reward that effort with efficiency. They are professional, and treat their users as adults.
In my career as a software developer, I know these ideas as traits of good APIs: Orthogonal, minimal, composable. APIs are great not when they are as expansive and convenient as possible, but when they are most constrained and recombinable. In other words, "don't optimize for stupid". That's what I understand as good design in both programming, and in professional UIs.
Which is quite the opposite of Intuitive. Quite the opposite of "easy to grasp without reading the manual". That's a description of a video game, not a professional tool. As professionals, we should seek and build empowering, efficient tools for adults, not intuitive games maskerading as professional software.
We went on a once-in-a-lifetime vacation this year. Six weeks in the USA, roving New England in an RV, and going on a helicopter tour around New York City. I have spent months thinking about the perfect camera for going on this trip. I went on several vacations with various permutations of cameras and lenses as a test-run. I data-mined my existing photographs to find the perfect setup through science. And I scoured the internet for untold hours to get it right.
Here's what I took with me:
That's two cameras! Most people would take
their smartphone one camera with multiple lenses. I take multiple cameras with one lens each. Why? Because they're both awesome, at different things:
The Fujifilm X-E3 with an XF 18-135 f/3.5-5.6 lens
The X-E3 is a mirrorless camera, which means it's small. It is a very recent camera with a modern sensor. It's a rangefinder-style camera with the viewfinder in the top left. So it doesn't obscure my whole face when I lift it to my eye.
Compared to a DSLR, this is a small and light camera1. But paired with the lens, it is neither. That lens, though: a massive focal range from 18 mm to 135 mm is suitable for wide landscapes, as well as tight tele shots of wildlife. Cramming it all in one lens must come with a loss in sharpness and aperture, but the 18-135 is a best of its breed, and still plenty sharp enough for me. And it comes with an image stabilization so sophisticated that I can hand-hold half-second exposures. Which is magic.
This is a great camera for deliberately framing a shot; it feels great to hold, with a neat and efficient control layout for quickly playing with apertures and shutter speeds and focal lengths with a minimum amount of fuss. It is a dependable tool, an amazing piece of engineering, and an artistic means of expression. It is truly a great camera.
If it has an Achille's heel, it's that it takes a while to get ready. Not because it is slow to start up, but because there's a delay before the viewfinder activates when raised to the face. Which just means you will stare into the void for half a second, while seeing the critical moment pass in front of you. Half a second of missing the shot.
And the white balance can struggle a bit with forests and lakes. Forests are green, and to compensate, the X-E3 turns everything else magenta. Easy to fix in post, but annoying. These are minor complaints, and most of the time the X-E3 is as perfect a camera as I've ever seen one.
Here are a few pictures I took with the X-E3 on my last vacation2:
Image quality is simply excellent. Not once has the X-E3 let me down. Autofocus is accurate, and instant. The lens is amazingly sharp all through its focal range. Dynamic range is unreal, at times. As you can see in the pictures, I've shot everything from landscapes to people, to wildlife, long exposures, and even some astro. All with one lens, on one camera.
The Ricoh GR
So, if the X-E3 is my "professional" camera, the Ricoh GR is my fun camera. In 2019, the GR old technology. It was released in 2013, it has a big but aging 16 megapixel APS-C sensor, and a fixed 18 mm f/2.8 lens with no zoom. It shouldn't be able to compete with the Fuji. But there is one thing the GR does better than any other camera: Speed.
The GR is in my pants pocket. It wouldn't be my pants without a GR in it. I have it with me everywhere, all the time. And at the right moment, I will pull it out, and take a picture, before you even notice it's there. That's the GR's mission: The fastest possible time from pants to picture!
Seriously, the GR goes from off to taking the first shot before even its screen turns on3. I can whip out my GR and take a picture long before you have unlocked your phone, never mind started the camera app, selected a focus point, or aimed it at your prey. In the car, I had the GR in the center console, and could take a picture of the roadside before it whizzed past. And without taking my eyes off the road, naturally.
Since the GR has a fixed and wide focal length, I don't have to look at its screen to see what I'm taking a picture of. I can do it by feel. Because all the controls are on the right, I can adjust everything with my right hand only, without taking my hands off the wheel, or stroller, or wife. And then there's snap focus, where a full-press of the shutter will take a shot at a pre-defined distance, without waiting for autofocus. I wish every camera had this feature!
However, all of this took some getting used to at first. I missed focus a lot in the beginning. But once I figured it out, a new world opened up. A world of whimsy and spontaneity that is simply not possible with a bigger or slower camera. A good half of my favorite pictures were only possible thanks to the GR.
Again, there are downsides. It's an old sensor. Dynamic range is limited. Autofocus could be faster. There's a noticeable shutter delay. There is no viewfinder or image stabilization. The white balance is not great. It can struggle with a full memory card4.
But none of that matters. I love its colors. I love its sharpness. It has taken quite the beating on this vacation, but I wouldn't have it any other way. The GR is magnificent. There is nothing else quite like it5. And so many memories would have been lost if not for the GR:
Many of these shots were taken in the spur of the moment. In situations where I didn't take the big camera with me. Where a big camera would have been inappropriate. But the GR is so tiny, so non-threatening, it can become part my daily life. And capture the fleeting moments that make life worth living. That is a photographic super power if there ever was one.
The body weighs 337g, which is a feather. The lens weighs 490g.
No personal pictures, though. I don't share those publicly.
The magic is "snap focus". If you depress the shutter fully, without waiting for autofocus, the GR immediately takes the shot at a pre-set distance. Works in any mode, and even when just powering on: Press the power button, aim, press the shutter, and before even the screen turns on you will have taken a picture. Lightning fast.
As the card fills up, getting into playback mode takes longer and longer. Minutes, for a full 64 Gb card.
Forget about the Fuji XF10. I tried it. It's a newer sensor, but so clumsy, so slow, so awkward to use. It's no comparison.
Our first stop in speech analysis is usually the short-time Fourier transform, or STFT1:
As we can see, this speech signal has a strong fundamental frequency track around 120 Hz, and harmonics at integer multiples of the fundamental frequency. We perceive the frequency of the fundamental as the speech's pitch. The magnitude of the harmonics varies over time, which we perceive as the sounds of different vowels.
All of this is clearly visible in this STFT. To calculate the STFT, we chop up the audio signal into short, overlapping blocks, apply a window function 2 to each of these blocks, and Fourier-transform each windowed block into the frequency domain:
We zoomed in a bit to make this visible. At this zoom level, an old friend becomes visible in the waveform: The waveform of our harmonic spectra is clearly periodic. We also see that each of our blocks contain multiple periods of the speech signal. So what happens if we make our blocks shorter?
As we make our blocks shorter, such that they only contain a single period of the signal, a funny thing happens: the harmonic structure vanishes. Why? Because the harmonic structure is caused by the interference between neighboring clicks; As soon as each block is restricted to a single period, there is no interference any longer, and no harmonic structure.
But instead of a harmonic structure, we see the periodicity of the waveform emerge in the STFT! To look at this phenomenon in a bit more detail, we will have to leave the classical STFT domain, and look at a close sibling, the Constant Q Transform, or CQT. The CQT is simply an STFT with different block sizes for different frequencies:
The periodic structure of the signal is now perfectly apparent. Each period of the signal gives rise to a skewed click-like spectrum that repeats at the fundamental frequency. Interestingly, however, the fundamental itself shows no periodicity. To see why this is so, and to finally get a grasp of how the short-time Fourier transforms interprets signals, we have to turn to our last graph, a gamma-tone filterbank:
The fundamental frequency is now clearly a visible as a single sinusoid, rotating its merry way at 120 Hz. Each harmonic is its own sinusoid at an integer multiple of the fundamental frequency, but changing in amplitude in lockstep with the findamental. Thus in this graph, we see both the harmonicity and the periodicity at the same time, and can finally appreciate the complex interplay of magnitudes and phases that give rise to the beautiful and simple sound of the human voice3.
A periodic signal has a harmonic spectrum. In the extreme case, a click train signal has a comb spectrum:
But why? After all, a solitary click has a uniform spectrum. So why should the sum of multiple clicks have a non-uniform spectrum?
The answer exposes a lot of detail of how spectra work, and gives us a glimpse into the inner workings of spectral phases. Let's start with that click: In that spectrum, we see that we can decompose a click into a sum of sine waves:
As we have seen in the spectrum of the click earlier, it is composed of all frequencies (uniform spectrum). But we see here that the individual sine waves are delayed just so, that they all add constructively at exactly one time, and form the click. At all other times, they cancel each other out. This per-sine delay is called the phase of that frequency.
But what happens if we have more than one click? How does that change things?
In a click train, odd frequencies in the Fourier series of the individual clicks cancel each other out (red/blue), and only harmonic frequencies (brown) remain. And this is exactly what our first spectrum showed: A periodic click train results in a harmonic spectrum.
Even though each click has a uniform spectrum, adding multiple clicks together cancels out all non-harmonic parts, and only a harmonic comb spectrum remains.
Now that I am officially a failed scientist, I might as well talk about my research in public. I spent the last few years analyzing speech recordings. Particularly, voiced speech, where vibrations in the vocal folds excite resonances in the vocal tract, and a tonal sound leaves our mouths and noses.
As humans, we are particularly tuned to recognizing these kinds of sounds. Even in loud background noise, even with dozens of people talking at the same time, we can clearly identify the sound of a human voice (even if we might not be able to understand the words).
Looking at these sounds from a physical point of view, we can see that it is made up of a fundamental frequency at the voice's pitch, and harmonics at integer multiples of the fundamental. And even though the sound is clearly composed of multiple harmonics, we perceive it as a single sound with a single pitch. Even more perplexing, we attribute all of these harmonics to a single voice, even if they criss-cross with tonal sounds from different sources.
Yet, speech recognition systems regularly struggle with such tasks, unless we feed them unholy amounts of data and processing power. In other words, there has to be more to speech than the simple figure above indicates.
One area is definitely time resolution. Obviously, when the vocal folds open to admit a puff of air into the vocal tract, phases align, and loudness is higher than when the vocal folds have closed again and phases go out of sync. This happens several hundreds of times per second, at the frequency of the fundamental. Yet, this phase coherence is invisible in most of our visualizations, such as the spectrogram above, or the MFCCs usually used in speech recognition, as they are too coarse for such short-time detail.
An even more interesting detail emerges from fMRI scans of people who are speaking, and people who are listening to speech: their activation patterns are strikingly similar. As in, motor groups are activating when listening, just as if actual speech muscles were moved. To me, this indicates that when we listen to speech, we simulate speaking. And I find it highly likely that we understand speech mostly in terms of the movements we would have to make to imitate it. In other words, we do not internalize speech as an audio signal, but as muscle movements.
This matches another observation from a different area: When learning a foreign language, we can not hear what we can not produce. If you didn't learn how to speak an Ö and Ü (two German umlauts) as a child, you will have a hard time hearing the difference as an adult. Yet they sound completely distinct to me. In a production model, this makes a lot of sense, as we wouldn't know how to simulate a sound we can not produce.
Bringing this back to the science of signal processing, I believe that most speech analysis algorithms are currently lacking a production model of speech. Speech can not be fully understood as an audio signal. It needs to be understood in terms of the variables and limitations of a human vocal tract. I believe that if we integrated such a physiological production model into our machine learning models, we wouldn't need to feed them such vast amounts of data and electricity, and might even get by without them.