Recently, diffusion models have become very popular and there is an excellent demonstration of how diffusion models work. [Diffusion models. Diffusion probabilistic models are… | by m0nads | Medium ]
Diffusion models have been adopted to neural vocoders. Although they are shown to produce high-quality speech data, the inference speed is relatively slow compared with other non-autoregressive model-based vocoders as they require many iterations to generate the data. PriorGrad addresses this problem by introducing a data dependent prior, specifically, Gaussian distribution with a diagonal covariance matrix whose entries are frame-wise energies of the Mel-spectrogram. As the noise drawn from the data dependent prior is closer to the target waveform than the noise from standard Gaussian, PriorGrad achieves faster convergence and inference with superior performance. Koiszumi et al. SpecGrad further improve the prior by incorporating the spectral envelope of the Mel-spectrogram to introduce the noise that is more like the target signal.
However, many existing neural vocoders focus on speech signals. We found that state-of-the-art neural vocoders provide insufficient quality when they are applied to a singing voice, possibly due to the scarcity of large-scale clean singing voice datasets and the wider variety in pitch, loudness, and pronunciations owing to musical expressions, which is more challenging to model.
To overcome this problem, we propose a hierarchical diffusion model that learns multiple diffusion models at different sampling rates, as illustrated in Figure 2.