By Mayank Kumar Singh, Senior Engineer At Sony Research India

12th April 2023

This blog is based on the paper at 2210.07508.pdf ( and has been accepted as a conference paper in ICASSP 2023. In this paper we propose a hierarchical diffusion model-based vocoder which improves upon the existing vocoders on singing voice domain.

First, we will cover what are neural vocoders and their applications and then move onto diffusion models and their limitations on the singing voice domain as vocoders. We will then get an overview of the proposed algorithm along with the results and the conclusion.

For more details about our proposed method, have a look at our paper on 2210.07508.pdf (

What are neural vocoders and their applications?

Neural vocoders generate a waveform from acoustic features using neural networks and have become essential components for many speech processing tasks such as text-to-speech, voice conversion, and speech enhancement, as they often operate in acoustic feature domains for the efficient modelling of speech signals.

For example, in the paper “Robust One-shot Singing Voice Conversion” and “Nonparallel Emotional Voice Conversion for unseen speaker-emotion pairs using dual domain adversarial network & Virtual Domain Pairing”, voice conversion is applied in the Mel-spectrogram domain and a neural vocoder finally converts the spectrogram to the waveform domain as illustrated in Figure 1.

Figure 1: Use of Neural Vocoders in voice conversion algorithms

What are diffusion models and their current limitations?

Recently, diffusion models have become very popular and there is an excellent demonstration of how diffusion models work. [Diffusion models.  Diffusion probabilistic models are… | by m0nads | Medium ]

Diffusion models have been adopted to neural vocoders. Although they are shown to produce high-quality speech data, the inference speed is relatively slow compared with other non-autoregressive model-based vocoders as they require many iterations to generate the data. PriorGrad addresses this problem by introducing a data dependent prior, specifically, Gaussian distribution with a diagonal covariance matrix whose entries are frame-wise energies of the Mel-spectrogram. As the noise drawn from the data dependent prior is closer to the target waveform than the noise from standard Gaussian, PriorGrad achieves faster convergence and inference with superior performance. Koiszumi et al. SpecGrad further improve the prior by incorporating the spectral envelope of the Mel-spectrogram to introduce the noise that is more like the target signal. 

However, many existing neural vocoders focus on speech signals. We found that state-of-the-art neural vocoders provide insufficient quality when they are applied to a singing voice, possibly due to the scarcity of large-scale clean singing voice datasets and the wider variety in pitch, loudness, and pronunciations owing to musical expressions, which is more challenging to model.

To overcome this problem, we propose a hierarchical diffusion model that learns multiple diffusion models at different sampling rates, as illustrated in Figure 2.

Figure 2: Hierarchical Prior Grad Proposed Method Overview

A brief overview of our proposed method

The diffusion models are conditioned on acoustic features and the data at the lower sampling rate and can be trained in parallel. During the inference, the models progressively generate the data from the low to high sampling rate. The diffusion model at the lowest sampling rate focuses on generating low frequency components, which enables accurate pitch recovery, while those at higher sampling rates focus more on high frequency details (Refer to Figure 3). This enables powerful modelling capability of a singing voice.

In our experiment, we apply the proposed method to PriorGrad and show that the proposed model generates high-quality singing voices for multiple singers, outperforming the state-of-the-art PriorGrad and Parallel WaveGAN vocoders.

Figure 3: Receptive field at different sampling rates. The same architecture covers a longer time period at lower sampling rates.

Proposed Method (Training Details)

We use PriorGrad as a baseline model for our proposed method. Although PriorGrad shows promising results on speech data, we found that the quality is unsatisfactory when it is applied to a singing voice, possibly due to the wider variety in pitch, loudness, and musical expressions such as vibrato and falsetto. To tackle this problem, we propose to improve the diffusion model-based neural vocoders by modelling the singing voice in multiple resolutions. An overview is illustrated in Figure 2. Given multiple sampling rates ƒ¹ₛ > ƒ²ₛ > · · · > ƒⁿₛ, the proposed method learns diffusion models at each sampling rate independently. The reverse processes at each sampling rate fᶦₛ are conditioned on common acoustic features c and the data at the lower sampling rate, except for the model at the lowest sampling rate, which is conditioned only on c. During the training, we use the ground truth data to condition the noise estimation models. Since the noise is linearly added to the original data and the model has direct access to the ground truth lower-sampling rate data, the model can more simply predict the noise for low-frequency components. This enables the model to focus more on the transformation of high-frequency components. At the lowest sampling rate (we use 6 kHz in our experiments), the data become much simpler than that at the original sampling rate, and the model can focus on generating low-frequency components, which is important for accurate pitch recovery of a singing voice. The training algorithm is illustrated in Figure 4.

Figure 4: Training algorithm

Proposed Method (Inference Details)

During inference, we start by generating the data at the lowest sampling rate and progressively generate the data at the higher sampling rate by using the generated sample at lower sampling rate as the condition. The inference algorithm is illustrated in Figure 5. In practice, we found that directly using the lower sampling rate prediction as the condition often produces noise around the Nyquist frequencies of each sampling rate, as shown in Figure 6(a). This is due to the gap between the training and inference mode; the ground truth data used for training the lower sampling rate model do not contain a signal around the Nyquist frequency owing to the anti-aliasing filter and the model can learn to directly use the signal up to the Nyquist frequency, while the generated sample used for inference may contain some signal around there due to the imperfect predictions and contaminate the prediction at a higher sampling rate. To address this problem, we propose to apply the anti-aliasing filter to the generated lower sampling-rate signal to condition the noise prediction model, as shown in Figure 6(b).

Figure 5: Inference algorithm

Figure 6: Anti-aliasing filter effect in the case of N = 2, ƒ1s =
24000, ƒ2s  = 6000.


We conducted objective as well as subjective evaluation to evaluate our model against the baseline Parallel WaveGAN and PriorGrad models. For the objective test, we calculate the voice/un-voice detection error (VDE), Multi-Resolution Shot Time Fourier Transform loss (MR-STFT), Mean Cepstral Distortion (MCD), Real-Time Factor (RTF) and Pitch Mean Absolute Error (PMAE). For the subjective test, we asked 20 human evaluators to rate the generated samples using a five-point scale (Mean Opinion Score: MOS). The results are shown in Table 1 and Table 2.

Audio Samples are demonstrated at a website at


We proposed a hierarchical diffusion model for singing voice neural vocoders. The proposed method learns diffusion models in different sampling rates independently while conditioning the model with data at the lower sampling rate. During the inference, the model progressively generates a signal while taking care of the anti-aliasing filter. Our experimental results show that the proposed method applied to PriorGrad outperforms PriorGrad and Parallel WaveGAN at similar computational costs. Although we focus on singing voices in this work, the proposed method is applicable to any type of audio. Evaluating the proposed method on different types of audio such as speech, music, and environmental sounds will be our future work.



Skip to content