15 Oct 2023

Two Years with Legacy Code

From January 2021 to the beginning of 2023, I worked on a legacy code base at Fraunhofer IDMT in Oldenburg. My task was the maintenance and development of a DNN-based speech recognition engine that had become terra incognita when its original developer had left the company a year before I started. The code had all the hallmarks of severe technical debt, with layers of half-used abstractions, many unused branches of unknown utility, and the handwriting of several concurrent programmers at odds with each other.

The code had evidently been written in a mad dash to bring the product to market. And not to discredit its developers, had been in production for several years, with a core of robust algorithms surrounded by helper scripts that had allowed the company to build upon, even after the original developers had left.

It was my job to clean it up. Having spent six years on my PhD recently, I welcomed the calmer waters of 'just' programming for a bit. This blog post is a summary of the sorts of challenges I faced during this time, and what kinds of techniques helped me overcome them.

The lay of the land

I approached the task from the outside, sorting through the build scripts first. Evidently, at least three authors were involved: One old-school Unix geek that wrote an outdated dialect of CMake, one high-level Python scripter, and one shell scripter that deeply believed in abstraction-by-information-hiding. The result of this was… interesting.

For a good few weeks I "disassembled" these scripts by tracing their execution manually through their many layers, and writing down the necessary steps that were actually executed. My favorite piece of code was a Makefile that called a shell script that ran a Python program, which instantiated a few classes and data structures, which ultimately executed "configure; make; make install" on another underying Makefile. I derived great satisfaction from cutting out all of these middle-men, and consolidating several directories of scripts into a single Makefile.

Similar simplifactions were implemented at the same time across several code bases by my colleagues. In due time, this concerted effort enabled us to implement continuous integration, automated benchmarking, and automated builds, but more on that later.

Data refactoring

The speech recognition software implemented a sort of interpreter for the DNN layers, originally encoded as a custom binary blob. Apparently, a custom binary approach had been taken to avoid dependencies on external parsing libraries. Yet the data had become so convoluted that both its compilation and its parsing were now considered unchangeable black boxes that impeded further development.

Again, I traced through the execution of the compiling code, noted down the pieces of data it recorded, and rewrote the compiler to produce a MsgPack file. On the parsing side, I wrote a custom MsgPack parser in C. Looking back, every job I've had involved writing at least a couple of data dumpers/parsers, yet many developers seem intimidated by such tasks. But why write such a thing yourself instead of using an off-the-shelf solution? In an unrelated code review later in the year one colleague used the cJSON library for parsing JSON; in the event, cJSON was several magnitudes bigger and more complex than the code base it was serving, which is clearly absurd. Our job as developers is to manage complexity, including that of our dependencies. In cases such as these, I often find a simple, fit-for-purpose solution preferable to more generalized external libraries.

A part of the DNN data came from the output of a training program. This output however was eternally unstable, often breaking unpredictably between version, and requiring complex workarounds to accommodate different versions of the program. The previous solution to this was a deeply nested decision tree for the various permutations the data could take. I simplified this code tremendously by calling directly into the other program's libraries, instead of trying to make sense of its output. This is another technique I had to rely on several times, hooking into C/C++ libraries from various Python scripts to bridge between data in a polyglot environment.

Doing these deep dives into data structures often revealed unintended entanglements. In order to assemble one data structure, you had to grab pieces of multiple different source data. Interestingly, once data structures were cleaned up to no longer have such entanglements, algorithms seemed to fall into place effortlessly. However, this was not a one-step process, but instead an ongoing struggle to keep data structures minimal and orthogonal. While algorithms and functions often feel easier to refactor than data structures, I have learned from this that it is often the changes to data structures that have the greatest effect, and should therefore receive the greatest scrutiny.

Code refactoring

My predecessor had left me a few screen casts by way of documentation. While the core program was reasonably well-structured, it was embedded in an architectural curiosity that told the tale of a frustrated high-level programmer forced to do low-level gruntwork. There were poor-man's-classes implemented as C structs with function pointers, there were do-while-with-goto-loops for exception handling, there were sort-of-dynamically-typed data containers, accompanied by angry comments decrying the stupidity of C.

Now I like my high-level-programming as much as the next guy, but forcing C to be something it isn't, is not my idea of fun. So over a few months I slowly removed most of these abstractions. Somewhat to my surprise, most of them turned out pure overhead that could simply be removed. Where a replacement was needed, I reverted to native C constructs. Tagged unions instead of casting, variable-length-arrays instead of dynamic arrays. Treating structs as values instead of references. This, alone, reduced the entire code base by a good 10%. The harder part was sorting out the jumble of headers and dependencies that had evidentally built up over time. Together with the removal of dead code paths, the overall code base shrank by almost half. There are few things more satisfying than excising and deleting unnecessary code.

I stumbled upon one particularly interesting problem when trying to integrate another code base into ours. Within our own software, build times were small enough to make logging and printf-debugging easier than an interactive debugger such as GDB. The other code base however was too complex to recompile on a whim, and a different solution had to be found. Now I am a weird person who likes to touch the raw command line instead of an IDE. And in this case this turned out to be a huge blessing, as I found that GDB can not only be used interactively, but can also be scripted! So instead of putting logging into the other library, I wrote GDB scripts that augmented break points with a little call printf(...) or print/d X. These could get suprisingly complicated, where one breakpoint might enable or disable other breakpoints conditionally, and break point conditions could call functions on their own. It took some learning, but these debugging scripts were incredibly powerful, and a technique I will definitely refer to in the future.

When adding new features to the software, I often found it impossible to work the required data flow into the existing program code without snowballing complexity. I usually took these situations as code smells that called for a refactoring. Invariably, each cleaning up of program flow or data structures inched the program closer and closer to allow my feature addition. After a while, this became an established modus operandi: independently clean the code until feature additions become easy and obvious, then do the obvious thing. Thus every task I finished also left the surrounding code in a better state. In the end, about 80% of the code base had gotten this treatment, and I strongly believe that this has left the project in a much better state than it was before. To say nothing of the added documentation and tests, of course.

More velocity makes bigger craters

As I slowly shifted from cleanup work to new features, change management became a pressing issue. New features had to be evaluated, existing features had to be tested, and changes had to be documented and downstreamed. Fascinatingly, the continuous integration and evaluation tools we built for this purpose, soon unearthed a number of hidden problems in other parts of the product that we had not been aware of (including that the main task I had been hired to do was less worthwhile than thaught, LOL). That taught us all a valuable lesson about testing, and proving our assertions. That said, I never found bottom-level unit tests all that useful for our purposes; the truly useful tests invariably were higher-level integration tests.

Eventually, my feature additions led to downstream changes by several other developers. While I took great care to present a stable API, and documenting all changes and behavior appropriately, at the end of the day my changes still amounted to a sizeable chunk of work for others. This was a particularly stark contrast to the previous years of perfect stagnation while nobody had maintained the library. My main objective at this point was to avoid the mess I had started out with, where changes had evidentally piled on changes until the whole lot had become unmaintainable.

Thus a balance had to be struck between moving fast (and breaking things), and projecting stability and dependability. One crucial tool for this job turned out to be code reviews. By involving team members directly with the code in question, they could be made more aware of its constraints and edge cases. It took a few months to truly establish the practice, but by the end of a year everyone had clearly found great value in code reviews as a tool for communication.

Conclusions

There is a lot more to be said about my time at Fraunhofer. The deep dive into the world of DNN engines was truly fascinating, as were the varied challenges of implementing these things on diverse platforms such as high-performance CPU servers, Laptops, Raspberry Pis, and embedded DSPs. I learned to value automation of developer tasks, and of interface stability and documentation for developer productivity.

But most of all, I learned to appreciate legacy code. It would have been easy to call it a "mess", and advocate to rewrite it from scratch. But I found it much more interesting to try to understand the code's heritage, and tease out the algorithmic core from the abstractions and architectural supports. There were many gems to be found this way, and a lot to be learned from the programmers before you. I often felt a strange connection to my predecessor, as if we were talking to each other through this code base. And no doubt my successor feels the same way about my code now.