

Shuqi Dai
A Diffusion Pipeline for
Multilingual Singing Voice Synthesis with Expressive Style Control
This is my internship project at NVIDIA Research from May 2022 to Feb 2023. The paper has not been published yet (hopefully coming out soon).
This paper takes score, lyrics, style label, and singer ID as input and generates expressive and realistic singing. It involves a cascade of diffusion models. The pipeline involves (1) performance control models, including timing, F0 curves, and loudness curves; (2) an acoustic model that generates the mel-spectrograms conditioning on performance control signals; (3) a DiffWave vocoder to generate the waveform from mel-spectrograms and F0 curves. The following figure shows a high-level architecture.

Input: 1. Score; 2. Lyrics; 3. Style; 4. Singer ID
Output: Expressive and realistic singing
Generated Example
A Happy Birthday Song in Chinese sung by different singers/styles
This song is not in the training data and is generated from the score from scratch:
BTW, many singers in this demo have never sung Chinese in the training data.
Multilingual & Stylistic Demo
Generated: Ground-Truth:
Generated: Ground-Truth:
Generated: Ground-Truth:
Some Opera Singing Generated: