Expressive Singing Synthesis

ExpressiveSinger: Multilingual and Multi-Style Score-based Singing Voice Synthesis with Expressive Performance Control

This is my internship project at NVIDIA Research from May 2022 to Feb 2023.

This paper takes score, lyrics, style label, and singer information as input and generates expressive and realistic singing. It involves a cascade of diffusion models. The pipeline involves (1) performance control models, including timing, F0 curves, and loudness curves; (2) an acoustic model that generates the mel-spectrograms conditioning on performance control signals; (3) a DiffWave vocoder to generate the waveform from mel-spectrograms and F0 curves. The following figure shows a high-level architecture.

Input: 1. Score; 2. Lyrics; 3. Style; 4. Singer Info

Output: Expressive and realistic singing

Generated Example
A Happy Birthday Song in Chinese sung by different singers/styles

This song is not in the training data and is generated from the score from scratch:

BTW， many singers in this demo have never sung Chinese in the training data.

Multilingual & Stylistic Demo

Generated: Ground-Truth:

Some Opera Singing Generated: