Have you ever wished if you were a good singer? Some people believe that it’s a natural ability that one can never acquire by practice (like my wife who’s a natural-born good singer and looks down on me in that regard). I disagree with her, but I admit that I’m not a good singer and failed to improve my singing over my entire life so far. Instead, I decided to get some help from AI to improve my singing.
SAIGE recently developed a deep learning system called “Deep Autotuner,” which takes an out-of-tune singing voice as its input and spits out an estimated in-tune version. How does it know how much a sung melody is out of tune? Well, as we humans can catch it even for the songs that have never been known to the listener, I suspect that it is based on a comparison between the main melody and the accompaniment. If the singing voice (or any other instrument) is off from the harmony, human brains are trained to recognize the mismatch as dissonance. Therefore, our deep learning system is trained to map an out-of-tune singing voice signal and its accompaniment signal to the in-tune version (i.e. the amount of pitch shift).
We were very excited in the
Smule’s Intonation dataset includes 4702 quality singing voice tracks by 3556 unique singers accompanied by 474 unique arrangements. Having them as the target, we came up with an artificially corrupted version of each of them, by detuning the original singing voice off by up to one semitone, as the simulated input to the system. A CNN+GRU network architecture working on CQT was adopted. Below are the pitch contours over frames.
Please check out the audio demo below. It’s still a bit robotic and noisy mainly due to the phase synthesis part, but we can feel that the deep autotuner is up and running!
Sound examples (plays the detuned sining first and then the autotuned version):
For more details, please see our recent paper2)Sanna Wager, George Tzanetakis, Cheng-i Wang, Lijiang Guo, Aswin Sivaraman, and Minje Kim, “Deep autotuner: A data-driven approach to natural-sounding pitch correction for singing voice in karaoke performances,” arXiv:1902.00956 [pdf] and source code3)Will be available soon.
References [ + ]
|1.||↑||Sanna Wager, George Tzanetakis, Stefan Sullivan, Cheng-i Wang, John Shimmin, |
|2.||↑||Sanna Wager, George Tzanetakis, Cheng-i Wang, Lijiang Guo, Aswin Sivaraman, and Minje Kim, “Deep autotuner: A data-driven approach to natural-sounding pitch correction for singing voice in karaoke performances,” arXiv:1902.00956 [pdf]|
|3.||↑||Will be available soon.|