Neural Audio Coding

Speech/audio coding has traditionally involved substantial domain-specific knowledge such as speech generation models. If you haven’t heard of this concept, don’t worry, because you might be using this technology in your everyday life, e.g., when you are on the phone, listening to the music using your mobile device, watching television, etc. The goal of audio coding is to compress the input signal into a bitstream, whose bitrate is, of course, smaller than the input, and then to be able to recover the original signal out of the code. The reconstructed signal should be as perceptually similar as possible to the original one. 

We’ve long wondered if there’s a data-driven alternative to the traditional coders. Don’t get me wrong though—our goal is to improve the coding efficiency with the help from deep learning, while we may need to keep some of the helpful techniques from the traditional coders. We are just curious as to how far can we go with a deep learning-based approach.

Our initial thought was simple: a bottleneck-looking autoencoder will do the job, as it will reduce the dimensionality in the code layer. The thing is, the dimension reduction doesn’t guarantee a good amount of compression, if each dimension should be represented with too many bits. The code has to be binary. A counterexample would be the case where the number of hidden units (or features) in the code layer is larger than the input, while the coding gain could be still high since each feature is encoded by only one bit. In other words, the encoder part of the autoencoder has to generate binarized features, not the regular real-valued ones. The binary features complicate backpropagation due to the non-differentiable nature of the discreteness, but this part wasn’t a big deal thanks to the existing solutions to this problem, such as softmax quantization1)Srihari Kankanahalli “End-to-end optimized speech coding with deep neural networks.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

It turned out that we became to employ an end-to-end framework, a 1d-CNN that takes time domain samples as the input and produces the same. No MDCT or filter banks. We love it, as it doesn’t involve any complicated windowing techniques and their frequency responses, overlap-and-add, adaptive windowing to deal with transient periods, etc. 

Our key observation in this project was that interconnecting multiple models is important. But how? We borrowed an idea from old school signal processing, residual coding. I know that it’s already revived in the ResNet architecture, which has identity shortcuts in between layers, while the neural network layers are modeling the residual of the output and the input of the layer. In our system, residual coding is implemented among cascaded neural networks, too. What that means is that our coding system consists of multiple autoencoders sequentially connected to each other—the first one does its best in recovering the input, but it also creates a residual signal, i.e., input minus output, since it’s not perfect. Then, the residual signal is fed to the second autoencoder, which is to recover the residual of the first one as much as possible. Once again, since the second one is not perfect either, it creates its own residual signal, which is taken care of by the third one. And so on. Below is the figure that shows the improvement of the sound quality by adding more autoencoding modules:

I know it’s a bit more complicated than I explained earlier because it turned out that naïvely learning a new autoencoder for residual coding isn’t the best way—it’s too greedy. For example, if an autoencoder screws up, the next one gets all the burden, while the size of the code grows linearly. So, in addition to the greedy approach, which just works as our fancy initializer, we employ a finetuning step that improves the gross autoencoding quality of all modules by doing backprop over all modules. Below is the overall architecture of the cascaded autoencoders for residual coding:

What’s even more fun for us is that it turned out that having a linear predictive coding (LPC) block helps the perceptual quality a lot. Since we see LPC as a traditional kind of autoencoding, it’s just another module, like the 0th module, in our cascaded architecture!

Another thing to note is that we cared much about the model complexity when we designed the system so that the inference during the test time (both encoding and decoding) is not too complicated with small memory footprint. See the comparison on the left with the other codecs:

Sample 1: Reference (ground-truth)
Sample 1: AMR-WB 23.85kbps
Sample 1: the proposed method 23.85kbps
Sample 2: Reference (ground-truth)
Sample 2: AMR-WB 19.85kbps
Sample 2: the proposed method 19.85kbps
Sample 3: Reference (ground-truth)
Sample 3: AMR-WB 15.85kbps
Sample 3: the proposed method 15.85kbps
Sample 4: Reference (ground-truth)
Sample 4: AMR-WB 8.85kbps
Sample 4: The proposed method 8.85kbps

Please check out our paper2)Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beack, and Minje Kim, “Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding,” In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), Graz, Austria, September 15-19, 2019 [pdf] about this project for more details.

References   [ + ]

1. Srihari Kankanahalli “End-to-end optimized speech coding with deep neural networks.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018
2. Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beack, and Minje Kim, “Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding,” In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), Graz, Austria, September 15-19, 2019 [pdf]