Our first stop in speech analysis is usually the short-time Fourier transform, or STFT1:
As we can see, this speech signal has a strong fundamental frequency track around 120 Hz, and harmonics at integer multiples of the fundamental frequency. We perceive the frequency of the fundamental as the speech's pitch. The magnitude of the harmonics varies over time, which we perceive as the sounds of different vowels.
All of this is clearly visible in this STFT. To calculate the STFT, we chop up the audio signal into short, overlapping blocks, apply a window function 2 to each of these blocks, and Fourier-transform each windowed block into the frequency domain:
We zoomed in a bit to make this visible. At this zoom level, an old friend becomes visible in the waveform: The waveform of our harmonic spectra is clearly periodic. We also see that each of our blocks contain multiple periods of the speech signal. So what happens if we make our blocks shorter?
As we make our blocks shorter, such that they only contain a single period of the signal, a funny thing happens: the harmonic structure vanishes. Why? Because the harmonic structure is caused by the interference between neighboring clicks; As soon as each block is restricted to a single period, there is no interference any longer, and no harmonic structure.
But instead of a harmonic structure, we see the periodicity of the waveform emerge in the STFT! To look at this phenomenon in a bit more detail, we will have to leave the classical STFT domain, and look at a close sibling, the Constant Q Transform, or CQT. The CQT is simply an STFT with different block sizes for different frequencies:
The periodic structure of the signal is now perfectly apparent. Each period of the signal gives rise to a skewed click-like spectrum that repeats at the fundamental frequency. Interestingly, however, the fundamental itself shows no periodicity. To see why this is so, and to finally get a grasp of how the short-time Fourier transforms interprets signals, we have to turn to our last graph, a gamma-tone filterbank:
The fundamental frequency is now clearly a visible as a single sinusoid, rotating its merry way at 120 Hz. Each harmonic is its own sinusoid at an integer multiple of the fundamental frequency, but changing in amplitude in lockstep with the findamental. Thus in this graph, we see both the harmonicity and the periodicity at the same time, and can finally appreciate the complex interplay of magnitudes and phases that give rise to the beautiful and simple sound of the human voice3.
A periodic signal has a harmonic spectrum. In the extreme case, a click train signal has a comb spectrum:
But why? After all, a solitary click has a uniform spectrum. So why should the sum of multiple clicks have a non-uniform spectrum?
The answer exposes a lot of detail of how spectra work, and gives us a glimpse into the inner workings of spectral phases. Let's start with that click: In that spectrum, we see that we can decompose a click into a sum of sine waves:
As we have seen in the spectrum of the click earlier, it is composed of all frequencies (uniform spectrum). But we see here that the individual sine waves are delayed just so, that they all add constructively at exactly one time, and form the click. At all other times, they cancel each other out. This per-sine delay is called the phase of that frequency.
But what happens if we have more than one click? How does that change things?
In a click train, odd frequencies in the Fourier series of the individual clicks cancel each other out (red/blue), and only harmonic frequencies (brown) remain. And this is exactly what our first spectrum showed: A periodic click train results in a harmonic spectrum.
Even though each click has a uniform spectrum, adding multiple clicks together cancels out all non-harmonic parts, and only a harmonic comb spectrum remains.
Now that I am officially a failed scientist, I might as well talk about my research in public. I spent the last few years analyzing speech recordings. Particularly, voiced speech, where vibrations in the vocal folds excite resonances in the vocal tract, and a tonal sound leaves our mouths and noses.
As humans, we are particularly tuned to recognizing these kinds of sounds. Even in loud background noise, even with dozens of people talking at the same time, we can clearly identify the sound of a human voice (even if we might not be able to understand the words).
Looking at these sounds from a physical point of view, we can see that it is made up of a fundamental frequency at the voice's pitch, and harmonics at integer multiples of the fundamental. And even though the sound is clearly composed of multiple harmonics, we perceive it as a single sound with a single pitch. Even more perplexing, we attribute all of these harmonics to a single voice, even if they criss-cross with tonal sounds from different sources.
Yet, speech recognition systems regularly struggle with such tasks, unless we feed them unholy amounts of data and processing power. In other words, there has to be more to speech than the simple figure above indicates.
One area is definitely time resolution. Obviously, when the vocal folds open to admit a puff of air into the vocal tract, phases align, and loudness is higher than when the vocal folds have closed again and phases go out of sync. This happens several hundreds of times per second, at the frequency of the fundamental. Yet, this phase coherence is invisible in most of our visualizations, such as the spectrogram above, or the MFCCs usually used in speech recognition, as they are too coarse for such short-time detail.
An even more interesting detail emerges from fMRI scans of people who are speaking, and people who are listening to speech: their activation patterns are strikingly similar. As in, motor groups are activating when listening, just as if actual speech muscles were moved. To me, this indicates that when we listen to speech, we simulate speaking. And I find it highly likely that we understand speech mostly in terms of the movements we would have to make to imitate it. In other words, we do not internalize speech as an audio signal, but as muscle movements.
This matches another observation from a different area: When learning a foreign language, we can not hear what we can not produce. If you didn't learn how to speak an Ö and Ü (two German umlauts) as a child, you will have a hard time hearing the difference as an adult. Yet they sound completely distinct to me. In a production model, this makes a lot of sense, as we wouldn't know how to simulate a sound we can not produce.
Bringing this back to the science of signal processing, I believe that most speech analysis algorithms are currently lacking a production model of speech. Speech can not be fully understood as an audio signal. It needs to be understood in terms of the variables and limitations of a human vocal tract. I believe that if we integrated such a physiological production model into our machine learning models, we wouldn't need to feed them such vast amounts of data and electricity, and might even get by without them.
As part of my PhD, I am supposed to publish three papers. So far, I have been unable to do so. But this is not about me, I will survive regardless. This is about the systems behind our papers' rejections. Because they are… bad. Political. Un-scientific.
Our first manuscript was submitted for publications, and got a middling review. If we wanted our work to be published, we were to expand on our introduction to mention the reviewers' favorite publications, and broaden our comparison to include their work. This is considered normal. In the second round of reviews we then got rejected, because our introduction was now too long, and our comparison too broad.
The reviews additionally claimed that "novelty cannot be claimed for something that is not validated and compared to state of the art" and that "[our work lacks] formal statistical evaluation of the estimation performance". Which is certainly true, but is also true of every other work published on the same topic in the last five years (I checked). We showed this evidence to the reviewers, but it was not even deemed worthy of a comment.
In hindsight however, we realized that we had included at least one reviewer's own algorithms in our comparison, and found it lacking. Their work had only ever been tested, publicly, with a single example recording, where it worked well. Our comparison did the same with twenty thousand recordings, which highlighted some issues. So our paper was rejected. Of course we can't be sure that this was ultimately true, as the reviewers' names are not disclosed to the reviewees (but certainly vice-versa).
Our next submission was to a different journal. This time, we had learned from our mistakes, and kept the scope of our investigation more minimal. There would be only a very small comparison, and we would be very careful not to step on anyone else's toes. The review was, again, negative.
This time, the grounds for rejection were lack of comparison to state of the art (not a winning move, see above), and our too high false negative rate. Additionally, it contained wonderful verbiage like:
The are many methods that are very similar to the presented method in the sense of being feature extraction methods operating in the STFT domain.
…which is just patently ridiculous. If being a "feature extraction method in the STFT domain" was grounds for rejection, there would be no publications in our area of research. And let's ignore for a minute that our publication was not, in fact, such a method.
Again, hindsight showed the real culprit: Our manuscript reported a high false negative rate of roughly 50%. Had we just not mentioned this, no one would have noticed. That is what everyone else is doing. More importantly however, reporting on false positive/negative rates in our evaluation called into question every other publication that hadn't. And we can't have that.
Another submission was liked because no one had done anything similar before, and was found to provide value to many researchers, but rejected because it still somehow "lacked novelty".
So, in summary, our first submission was rejected because it made one of the reviewers look bad, and the second because we not only wanted to report on our method's advantages, but also its shortcomings. Worse, in following the evidence where it lead, we had created new error measures that could potentially find flaws in existing publications, which could potentially make a whole lot of researchers look bad.
After five years of dealing with this, I am thoroughly disheartened. Instead of a system for disseminating knowledge, I have found the scientific publishing system a political outlet for the famous, and a lever for keeping uncomfortable knowledge suppressed. This is not the scientific world I want to live in, and apparently, it doesn't want me to live in it, either.
So I bought a new camera. Now I need new lenses. In this post, I am looking for a standard zoom lens, i.e. something that covers a bit of wide-angle, all the way through the normal range, up to a bit of telephoto. In Fuji's lineup these needs are met by
the XC 15‑45 mm f/3.5‑5.6 OIS PZ (€ 150, 136 g, 4.4 cm)
- the XC 16‑50 mm f/3.5‑5.6 OIS II (€ 150, 195 g, 6.5 cm)
- the XF 18‑55 mm f/2.8‑4 R LM OIS (€ 250, 310 g, 7.0 cm)
- the XF 18‑135 mm f/3.5‑5.6 R LM OIS WR (€ 500, 490g, 9.8 cm)
the XF 16‑55 mm f/2.8 R LM WR (€ 650, 655 g, 10.6 cm) a bag of primes (€ inf, many g, lots of cm)
- the XF 27 mm f/2.8 (€ 150, 78 g, 2.3 cm)
The XC 15‑45 is out, because the zoom ring is not turn-to-zoom like any other zoom lens. I tried it; I couldn't stand it. The XF 16‑55 is out because it is just too expensive and big for me. A bag of primes is not what I want, but I included the XF 27, because that's what I happened to have at hand. All of the above prices are used prices as of early 2019 in Germany.
Before we get started, all of these lenses are perfectly sharp to my eyes. At least in the center-ish area of the image, every pixel shows different information, which is not something I could have said about some of the Nikon lenses I used to own. Because of that, I will not compare sharpness.
The word of mouth is that the XF 18‑55 is a stellar kit lens, the XC 16‑50 is a bit cheap, and the XF 18‑135 a bit of a compromise. If internet forums are to be believed, these differences are massive, and the XF 18‑55 is really the only acceptable non-prime lens any self-respecting Fuji fanboy can buy. But then again, that's what internet forums would say, right?
With that out of the way, let's have a look at these lenses! I'll use crops from a terribly boring shot linked here for most of my examples. Why this shot? Because a), that's what was available, and b), because it contains areas that nicely showcase these lenses' qualities. All shots were taken at f/8 at 27 mm, ISO 400 and a shutter speed around 1/300 s.
I often read that micro contrast really tells lenses apart. To illustrate this point, here's an area with very little overall contrast, particularly between the fence and the orange paint, and the fence and the gray stairs:
If you look very closely, you might find the fence slightly less visible on the XC 16‑50 than on the XF 18‑135 and XF 27, and ever so slightly more visible on the 18‑55. But it should also be obvious that these differences are incredibly tiny, and not worth fussing over.
Corner Sharpness and Chromatic Aberrations
Another common point is corner sharpness, which is typically said to strongly favor primes. This time, the images are cropped from the bottom-right corner of the image that contains some detail, but most importantly a bright white warning sign:
And indeed, the XC 16‑50 is noticeably blurry this time, with the other three lenses similarly sharp. The warning sign also highlights color fringes on the transitions from the bright white sign to the dark background. These chromatic aberrations are almost invisible on the XF 18‑55, and mild on the XF 18‑135 and XF 27.
Bear in mind, however, that these are 100% crops in the very furthest corners of a high-contrast image. In normal pictures, none of these issues will be noticeable unless you really zoom in on fine details at the edges of your frame. The chromatic aberrations seen here were already treated in software with the lens correction module in Darktable, but it might be possible to improve on these results with more dedicated processing.
Update 1: Even more comparison pictures
After publishing the blog post, I still wasn't satisfied: What if the results I got were only true at f/8? What if image quality got worse at a longer focal length? How does my XF 18 stack up?
To answer this, I took another set of pictures and prepared another composite of an crops near the image center, and another one near the lower right corner.
To be perfectly honest, I can not see any significant differences between any of these pictures. At this point, I am starting to question the entire concept of sharpness and micro contrast for evaluating lenses. But at least I learned a lot about how to use the Gimp.
As a sanity check, I repeated the experiment with my old Nikon 18-200, and this was in fact noticeably less sharp. And slightly overexposed. And slightly off-color. That's why I switched to Fuji. But as I said, this was a sanity check, not a fair comparison, as the Nikon D7000 body is much older than my Fuji X-E3, and the lens has surely seen better days as well.
Update 1: Ergonomics and Balance
The XC 16‑50 and XF 27 operate their aperture with the control wheel on your right thumb. The XC 18, XF 18‑55, and XF 18‑135 have a dedicated aperture ring on the lens instead. Thus, the former two lenses can be controlled with the right hand alone, while the latter two require the left hand on the lens barrel. Zoom is always controlled on the barrel, though.
This preference for one-handed or two-handed operation is supported by the lenses' weight, as well: My camera, the X-E3, weighs about 330 g. With the 200 g XC 16‑50, the weight is mostly in the camera body, and can easily be held and operated with one hand. The 250 g XF 18‑55 and the 500 g XF 18‑135 are more lens-heavy, which makes a two-handed grip necessary, which is a better fit for the aperture ring on the lens.
Personally, I actually prefer the aperture on the thumb wheel over the unmarked aperture rings on the XF 18‑55 and the XF 18‑135. It just feels more natural in my hands. On the other hand, I like the marked aperture ring on the XF 18, particularly for resetting the aperture without looking through the viewfinder, or when the camera is turned off. In fact, I find the ability to operate the camera while turned off to be very useful in general. It is one of the major reasons why I like Fuji cameras.
Update 2: Image Stabilization
In order to assess the image stabilization systems built into these lenses, I took a series pictures of a static subject at 18 mm, 27 mm, and 50 mm, for shutter speeds of 1/30 s, 1/15 s, 1/8 s, 1/4 s, and 1/2 s. I then looked at five images for every combination of lens, focal length, and shutter speed, and labeled them either sharp if there was no visible blur at all, or usable if there was micro-shake only visible at 100 %, or miss if the shot was too blurry.
The XC 16‑50 had perfect sharpness at 1/30 s, was at least ok between 1/15 s and 1/8 s, and even 1/4 s still had a few usable shots. 1/2 s or longer was unusable. There was no significant difference between the focal lengths. That last bit is really interesting, as I would have expected shorter focal lengths to be easier to hand-hold than longer ones.
The XF 18‑55 stayed perfectly sharp one stop longer until 1/15 s, but otherwise performed exactly the same as the XC 16‑50. I would guess that the small difference in stability between these two lenses is mostly due to their weight difference, but that the image stabilization system is identical.
The XF 18‑135, however, was another matter: All shots up until 1/8 s were perfectly sharp, and remained at least usable until 1/2 s! Only at 1 s of shutter speed did I see significant numbers of missed shots! Again, there was no significant difference across focal lengths.
With disabled image stabilization, I could hand-hold most shots for at most 1/focal length, but missed or fudged a few shots even there.
In summary, I found the XC 16‑50 and XF 18‑55 image stabilization good for about two stops, and astonishingly, the XF 18‑135 stable for a full four stops over my personal hand-holding skills. Some of that stability is no doubt due to the increased weight of the XF 18‑135, but nevertheless, I find these results astonishing!
Close Focus Distance and Magnification
And now, the darling of all photographers: out-of-focus backgrounds. Common wisdom is that the bigger the aperture, the more the background is thrown out of focus. But that's only part of the truth, and honestly, not the most interesting part for these kinds of limited-aperture lenses. Much more powerful is getting closer to your subject: The closer you focus, and the farther away your background, the more the background will be out of focus. This effect gets even stronger when you zoom in.
The XC 16‑50 focuses much more closely than any other lens in this list, at 12 and 30 cm (Fuji says 15 cm). You can get really nice background separation with this lens, and great magnification in your macro shots. The XF 18‑55 focuses at 25 and 35 cm (Fuji: 40 cm), which is not particularly impressive. The XF 18‑135 focuses even farther, at 33 and 43 cm (Fuji: 45 cm), but gains magnification through its long tele zoom. The XF 27 is not optimized for this kind of thing at all, at 29 cm (Fuji: 34 cm).
To me, the XC 16‑50 is the winner for a small/light zoom kit. It might be the least great option optically, but the differences are not dramatic at all, and it is the cheapest, smallest, and lightest lens with the most useful wide end and the closest focusing. But it lacks a dedicated aperture ring and is a plastic construction instead of a metal one, which does detract from the haptic joy somewhat.
The XF 18‑55 is optically the strongest lens. It might even beat the XF 27 prime lens on its own turf! But the optical differences to the cheaper XC 16‑50 and the more versatile XF 18‑135 are quite small, and are not be worth the price/weight/inconvenience to me.
The XF 18‑135 is really surprisingly good. The much longer focal range necessarily comes with compromises in optical quality and bulk, but it seems no significant corners where cut in this case. And the image stabilization is a significant step above the other two lenses. Considering that this lens usually replaces at least two other lenses, I even find the price reasonable. This is my first choice as a do-everything zoom kit.
The XF 27 is not very strong in any particular way, except size. And that size trumps all. If I just want to throw a camera in my bag without any particular photographic intentions, the XF 27 is my first choice. And possibly the XF 18, if I still have room in my bag.
As some small buying advice, the XC 16‑50 was refreshed in 2015 with the OIS II version, which introduced that nice close focusing distance (highly recommended). The XF 18‑135 was apparently built in two batches, the original made in China version that seemed to have horrible QA issues, and a second made in Philippines version in 2017 without.
What I didn't mention
Aperture. The XF 18‑55 and XF 27 have a wider maximum aperture than the XC 16‑50 or XF 18‑135, by about two thirds of a stop. Shooting at bigger apertures makes brighter pictures with stronger background blur, and some loss in sharpness. I don't find the optical performance wide-open particularly interesting, because most of the time I'd use large apertures to blur the background, making sharpness and distortion mostly irrelevant. And as I said above, getting closer is usually more effective for background blur than maximum aperture, anyway.
Image stabilization. The three zooms offer optical image stabilization systems. From what I can tell, the XF 18‑135 is significantly more effective in this regard than the XC 16‑50 or the XF 18‑55. Hand-held shots with up to about 1/10th of a second seem easily achievable with the XF 18‑135, whereas the unstabilized XF 27 becomes blurry at 1/40th. Videos are noticeably smoother with the XF 18‑135 as well.
Weather sealing. The XF 18‑135 is weather sealed, the other lenses are not. My camera is not, so I don't care.
Distortion and Vignetting. Is fixed in post. No need obsessing over it.
Autofocus speed. Is good. No need obsessing over it.