Learning to Hash for Source Separation

We have cared much about the efficiency of machine learning inference process. As a part of this effort, we recently came up with a hash code-based source separation system, where we used specially designed hash function to increase the source separation performance and efficiency.

KNN Source Separation

Our idea starts from a nearest neighborhood search. Imagine you have a set of mixture spectra, whose target clean sources are already known. From them, we can construct a rather exhaustive dictionary of mixture spectra in the frequency domain (the row vectors of the pink matrix in the figure below).

A naïve KNN search-based source separation.

For a given test spectrum \bm{x}, this naïve separation system first finds out a few most similar spectra (e.g., by doing inner product between \bm{x} and all the row vectors of \bm{H}. Since \bm{H} is a large dictionary, hopefully there are a few items that are similar enough. If the search is successful, then the rest of the job is a done deal, because we already know the corresponding source spectra of all the dictionary items. We can simply collect the source vectors corresponding to the KNN dictionary items, and then average them out1)In this system, we used ideal binary masks (IBM) of the corresponding mixture instead of the direct source estimates..

Although this naïve idea works quite well (we could get about 9.66 dB improvement in terms of signal-to-noise ratio), we found it disturbing, because the KNN search for every incoming test spectrum is unrealistically expensive.

A hashing-based speed-up!

That’s why we need a hashing technique. As a function, hashing can convert an input multi-dimensional vector \bm{x} into a bitstring \mathbf{x}: \mathbf{x}\leftarrow\bm{\phi}(\bm{x}). So what? When we compare two bitstrings we can do it by using simple Hamming distance calculation, which is a bitwise operation (see the figure below for an example).

To compute Hamming distance, each bit pair is compared using bitwise agreement (the “XOR” operation). Then, counting the “on” bits will give us the Hamming similarity, which is another bitwise operation.

For example, if we use locality sensitive hashing (LSH)2) P. Indyk and R. Motwani, “Approximate nearest neighbor – towards removing the curse of dimensionality,” in Proceed- ings of the Annual ACM Symposium on Theory of Computing (STOC), 1998, pp. 604–613.3)M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distribu- tions,” in Proceedings of the twentieth annual symposium on Computational geometry. ACM, 2004, pp. 253–262., the hash function randomly generates a projection matrix \bm{P} and multiply it to the input vector to take the sign of the projected coefficients: \text{sign}(\bm{P}\bm{x}). Hence, if there are L rows in \bm{P}, we get L bipolar binary values4)+1’s and -1’s.. For example, if we start from a 513-dimensional Fourier magnitude coefficients \bm{x}\in\mathbb{R}^{513} and then use \bm{P}\in\mathbb{R}^{150\times 513} we can reach 8.75 dB improvement. It’s quite a good amount of reduction from 513 floating-point values down to 150 binary values in terms of both spatial and computational complexity.

Now the KNN search is done efficiently in the hash code space.

All this hashing-related argument is cool, but we wanted to do it better. Obviously, while simple and effective, LSH might not be the best hash function for my problem, since it’s purely based on a bunch of random projections. What if there is another hash function that works as good as this L=150 LSH case, but with much smaller L, such as only 25 bits?

Limitations of LSH

Let’s revisit the LSH algorithm to understand its limitations. Again, LSH is based on random projections. It means that for a pair of very close examples, it’s hard for them to be on the opposite sides (positive and negative) after the projection, as illustrated in the following figure.

After the projection onto the blue projection vector, whose own origin is denoted by the dark grey line (or its separating hyperplane), the pair of examples are on the same side. It means that the binary code generated form this projection will be -1 for both examples.
Even if we repeat the random projection multiple times, due to the proximity, it’s hard for the pair to be on the opposite sides. The newly added black lines are all the other potential projection vectors, which all result in the same sign after the projections. In other words, they all failed to separate the pair.

I’d like to call these random projections “a drunk knight’s swinging experiment.” Imagine a pair of apples hanging in the air, but their distance is only an inch. It might be difficult for the drunk knight to swing a sword in-between the apples, ’cause HE IS DRUNK! That’s why LSH produces similar bitstrings for originally similar input vectors.

On the other hand, LSH CAN create discriminative bitstrings, when the apples are far away. In other words, LSH projections are more probable to generate differently signed bits if the compared examples are far from each other in the original data space. See the example below.

The exactly same set of projections can all “separate” the pair, and create distinctive bitstrings.

Boosted LSH

To this end, we developed a new LSH algorithm, called Boosted LSH (BLSH), where we introduce the AdaBoost algorithm to train the projections. With boosting, what we can do is to learn the projection vectors “one-by-one” in the order of their importance. Let’s start by learning the first projection vector. Even though it’s going to give us just a sign for each of the examples, it should best represent the similarity and dissimilarity of all the spectra. The learning is relatively simple, because we will eventually learn a projection vector followed by a nonlinearity function, i.e., the sign function. Ring a bell? It’s just a perceptron! For the next one, according to the boosting scheme, we will focus on the examples that are mis-represented by the first code, by focusing more on them. By doing so, whenever we add a new perceptron, or a weak learner, it’s to make up the mistakes the previously learned weak learners have made so far.

A nonlinear classification problem where the ideal decision hyperplane should be a large circle in-between the outer circle and the inner circle. By adding 20 weak learners, we are not quite there yet. The size of the blue dots and yellow crosses represents their importance during the weak learner-training process. Since they are all quite large, they are still misclassified.
After combining 500 weak learners, we can solve this nonlinear classification problem. Note that, after 500 weak learners, no examples are significantly misclassified, so the markers are all small.

Our challenge is that we don’t do classification, but the KNN source separation. So, we cannot check on the “misclassification” error to pay different attentions to the examples. Instead, we will use the pairwise similarity between the original spectra as our target: for a given pair of spectra in the \bm{H} matrix, if the pairwise Hamming similarity of the so-far learned hash codes is off from the original, we give a large weight to that pair to instruct the next perceptron to pay attention to it. Here’s the target of our training, the self-similarity matrix (SSM):

Simple pairwise cosine similarity among all mixture spectra pairs serves as our training target.

Given this ground-truth SSM, our first perceptron should give us a binary value per example, which as a feature should try to catch up the shape of the GT SSM. The first hash code we learn from the first perceptron indeed recovers some large structures in the GT SSM:

The recovered SSM from the single-bit hash code.

When we train this first perceptron, we repeatedly perform a prediction using current perceptron weights (or the projection vector), and then the predicted bipolar binary values construct the predicted SSM. The predicted SSM is compared against the ground-truth SSM to compute the loss, which is backpropagated to updated the perceptron.

The comparison of the SSMs to compute the loss.

For the following perceptrons, whenever we add more of them to the pool of weak learners, they gradually improve the performance, meaning the recovered SSM looks more similar to the original. For the 5th perceptron, for example, it looks like this:

It’s still rough but quite similar now.

Learning this l-th perceptron is a little different from the very first perceptron training, as we need to give different weights to the “misclassified” examples. In our case, out of the comparison of SSMs, we can identify some SSM elements that are not doing well. It’s of course different from “misclassification” concept, but equally effective for our boosting algorithm. After all, when we compute the loss, we element-wise multiply this weight matrix as follows:

The rightmost weight matrix gives more weights to the details that the so-far-trained binary SSM (the leftmost one) has missed.

Finally, below is the binary SSM we compute after learning 100 projections along with the ground truth side-by-side:

Looks very similar to me.

Experimental results

Let’s take a look at this graph that summarizes the behavior of our hashing-based source separation model.

First off, all the models enjoy the improved separation performance (in terms of signal-to-distortion ratio) by using longer hash codes (L denotes the number of bits). But, the proposed BLSH methods (blue and green) consistently outperform the ordinary random projection-based LSH (red and purple). What’s more is about the size of the \bm{H} matrix. If we use only 1% of available mixture spectra for our dictionary, the random projection-based LSH suffers a lot from the low performance and an increased uncertainty (purple). It means that, depending on how the random projection matrix is sampled, the performance varies significantly (see the large confidence intervals). On the other hand, with our BLSH algorithm, although its separation quality is degraded when we use too few examples (1%), the confidence interval is much narrower (green), showcasing that BLSH is free from the performance fluctuation issue that LSH suffers from.

In the table below, we compare BLSH with various other setups, including an open case (where the test signals’ noise sources are not known to the hashing algorithm). Surprisingly, BLSH outperforms an STFT-based baseline. It means that BLSH succeeded to learn better features than the raw Fourier coefficients. It’s still far from deep learning models’ performance, but considering the efficiency, BLSH is promising for source separation in resource-constrained environments.

The results are from the case when L=150.

Audio examples

In the following examples, we will see the merit of BLSH over LSH, because for BLSH the bit length L=25, while LSH had to use 125 bits to do the similar-quality separation. The point is that BLSH is still effective in a lower-bit representations.

The input 0dB mixture
LSH with L=125. The SDR of the reconstruction is 8.95 dB
BLSH with L=25. The reconstruction SDR is 8.33dB.

Below we introduce a few other examples as the BLSH-based separation results. On the left, they are the mixture input to the system, and the corresponding separation results are on the right.

Noisy speech
Cleaned-up speech
Noisy speech
Cleaned-up speech
Noisy speech
Cleaned-up speech

For more details please take a look at our ICASSP 2020 paper, which was nominated for the Best Student Paper Award5)Sunwoo Kim, Haici Yang, and Minje Kim, “Boosted Locality Sensitive Hashing: Discriminative Binary Codes for Source Separation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 4-8, 2020. [pdfcode].

We also open sourced the code6)https://github.com/sunwookimiub/BLSH.

Finally, check out Sunwoo’s virtual presentation below.

This material is based upon work supported by the National Science Foundation under Award Number:1909509.

References   [ + ]

1. In this system, we used ideal binary masks (IBM) of the corresponding mixture instead of the direct source estimates.
2. P. Indyk and R. Motwani, “Approximate nearest neighbor – towards removing the curse of dimensionality,” in Proceed- ings of the Annual ACM Symposium on Theory of Computing (STOC), 1998, pp. 604–613.
3. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distribu- tions,” in Proceedings of the twentieth annual symposium on Computational geometry. ACM, 2004, pp. 253–262.
4. +1’s and -1’s.
5. Sunwoo Kim, Haici Yang, and Minje Kim, “Boosted Locality Sensitive Hashing: Discriminative Binary Codes for Source Separation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 4-8, 2020. [pdfcode]
6. https://github.com/sunwookimiub/BLSH