Basti's Scratchpad on the Internet

Posts tagged "science":

11 Sep 2019

How we perceive Speech

Now that I am officially a failed scientist, I might as well talk about my research in public. I spent the last few years analyzing speech recordings. Particularly, voiced speech, where vibrations in the vocal folds excite resonances in the vocal tract, and a tonal sound leaves our mouths and noses.

As humans, we are particularly tuned to recognizing these kinds of sounds. Even in loud background noise, even with dozens of people talking at the same time, we can clearly identify the sound of a human voice (even if we might not be able to understand the words).

spectrogram.png
Figure 1: A spectrogram of speech (it's me, saying: "Es war einmal ein Mann")

Looking at these sounds from a physical point of view, we can see that it is made up of a fundamental frequency at the voice's pitch, and harmonics at integer multiples of the fundamental. And even though the sound is clearly composed of multiple harmonics, we perceive it as a single sound with a single pitch. Even more perplexing, we attribute all of these harmonics to a single voice, even if they criss-cross with tonal sounds from different sources.

Yet, speech recognition systems regularly struggle with such tasks, unless we feed them unholy amounts of data and processing power. In other words, there has to be more to speech than the simple figure above indicates.

One area is definitely time resolution. Obviously, when the vocal folds open to admit a puff of air into the vocal tract, phases align, and loudness is higher than when the vocal folds have closed again and phases go out of sync. This happens several hundreds of times per second, at the frequency of the fundamental. Yet, this phase coherence is invisible in most of our visualizations, such as the spectrogram above, or the MFCCs usually used in speech recognition, as they are too coarse for such short-time detail.

An even more interesting detail emerges from fMRI scans of people who are speaking, and people who are listening to speech: their activation patterns are strikingly similar. As in, motor groups are activating when listening, just as if actual speech muscles were moved. To me, this indicates that when we listen to speech, we simulate speaking. And I find it highly likely that we understand speech mostly in terms of the movements we would have to make to imitate it. In other words, we do not internalize speech as an audio signal, but as muscle movements.

This matches another observation from a different area: When learning a foreign language, we can not hear what we can not produce. If you didn't learn how to speak an Ö and Ü (two German umlauts) as a child, you will have a hard time hearing the difference as an adult. Yet they sound completely distinct to me. In a production model, this makes a lot of sense, as we wouldn't know how to simulate a sound we can not produce.

Bringing this back to the science of signal processing, I believe that most speech analysis algorithms are currently lacking a production model of speech. Speech can not be fully understood as an audio signal. It needs to be understood in terms of the variables and limitations of a human vocal tract. I believe that if we integrated such a physiological production model into our machine learning models, we wouldn't need to feed them such vast amounts of data and electricity, and might even get by without them.

Tags: signal-processing science

Publish or Perish

As part of my PhD, I am supposed to publish three papers. So far, I have been unable to do so. But this is not about me, I will survive regardless. This is about the systems behind our papers' rejections. Because they are… bad. Political. Un-scientific.

Our first manuscript was submitted for publications, and got a middling review. If we wanted our work to be published, we were to expand on our introduction to mention the reviewers' favorite publications, and broaden our comparison to include their work. This is considered normal. In the second round of reviews we then got rejected, because our introduction was now too long, and our comparison too broad.

The reviews additionally claimed that "novelty cannot be claimed for something that is not validated and compared to state of the art" and that "[our work lacks] formal statistical evaluation of the estimation performance". Which is certainly true, but is also true of every other work published on the same topic in the last five years (I checked). We showed this evidence to the reviewers, but it was not even deemed worthy of a comment.

In hindsight however, we realized that we had included at least one reviewer's own algorithms in our comparison, and found it lacking. Their work had only ever been tested, publicly, with a single example recording, where it worked well. Our comparison did the same with twenty thousand recordings, which highlighted some issues. So our paper was rejected. Of course we can't be sure that this was ultimately true, as the reviewers' names are not disclosed to the reviewees (but certainly vice-versa).

Our next submission was to a different journal. This time, we had learned from our mistakes, and kept the scope of our investigation more minimal. There would be only a very small comparison, and we would be very careful not to step on anyone else's toes. The review was, again, negative.

This time, the grounds for rejection were lack of comparison to state of the art (not a winning move, see above), and our too high false negative rate. Additionally, it contained wonderful verbiage like:

The are many methods that are very similar to the presented method in the sense of being feature extraction methods operating in the STFT domain.

…which is just patently ridiculous. If being a "feature extraction method in the STFT domain" was grounds for rejection, there would be no publications in our area of research. And let's ignore for a minute that our publication was not, in fact, such a method.

Again, hindsight showed the real culprit: Our manuscript reported a high false negative rate of roughly 50%. Had we just not mentioned this, no one would have noticed. That is what everyone else is doing. More importantly however, reporting on false positive/negative rates in our evaluation called into question every other publication that hadn't. And we can't have that.

Another submission was liked because no one had done anything similar before, and was found to provide value to many researchers, but rejected because it still somehow "lacked novelty".

So, in summary, our first submission was rejected because it made one of the reviewers look bad, and the second because we not only wanted to report on our method's advantages, but also its shortcomings. Worse, in following the evidence where it lead, we had created new error measures that could potentially find flaws in existing publications, which could potentially make a whole lot of researchers look bad.

After five years of dealing with this, I am thoroughly disheartened. Instead of a system for disseminating knowledge, I have found the scientific publishing system a political outlet for the famous, and a lever for keeping uncomfortable knowledge suppressed. This is not the scientific world I want to live in, and apparently, it doesn't want me to live in it, either.