This article presents our work on non-parallel emotion voice conversion and addresses the problem of converting the emotion of speakers (of whom we only possess neutral data during the time of training and testing i.e., for unseen speaker-emotion combinations) and is based on a research paper accepted in ICASSP 2023, “Nonparallel Emotional Voice Conversion for Unseen Speaker-Emotion pairs using Dual Domain Adversarial Network & Virtual Domain Pairing”, co-authored by:
Collaborative Background: In Sony Research India, we are given opportunities to work and collaborate with experts across global Sony Group of Companies. Being the experts in developing speech technologies for Indian languages, we explored the opportunity to forge a close collaboration with Dr.Naoya Takahashi, one of the leading experts in the audio/speech domain in Sony Group Corporation, Japan to develop an Emotional Voice Conversion system.
Emotional voice conversion (EVC) system converts the emotion of a given speech signal from one style to another,without modifying the linguistic content of the signal. EVC technology has potential applications in movie dubbing, conversational assistance, cross-lingual synthesis, etc.
Most of the previous approaches of EVC systems can convert the emotion of a speaker whose emotional data is present either at the time of training or testing, i.e., for seen speaker-emotion combinations only. However, collecting emotional voice for target speakers is often expensive, time- consuming, and sometimes impossible. In this paper, we address the problem of converting the emotion of speakers (of whom we only possess data having neutral emotion) by leveraging emotional speech data from other supporting speakers.
We first modify the StarGANv2-VC architecture for converting the speaker and emotion stylessimultaneously in a unified model by utilizing two encoders for learning speaker style and emotion style embeddings along with dual domain source classifiers for classifying source speaker and the emotion style. We then devise training strategies to achieve EVC for Unseen Speaker-Emotion Pairs (i.e., EVC-USEP) by using emotional data from supporting speakers. To achieve this, we propose a Virtual Domain Pairing (VDP) training strategy, which randomly generates the combinations of speaker-emotion pairs that are not present in the real data without compromising the min-max game of a discriminator and generator in adversarial training. In particular, a fake-pair masking (FPM) strategy is proposed to ensure that the discriminator does not overfit because of the fake pairs. We refer our proposed system as EVC-USEP throughout the paper.
Figure 1: Block diagram of the proposed EVC-USEP architecture.
We presented our results on Hindi emotional database. Demo audio samples can be found online.
We have conducted two subjective tests, namely, mean opinion scores (MOS) and ABX test to evaluate the quality of converted voices and evaluation of emotion conversion, respectively. For objective evaluation, we use an emotion classification network to evaluate the accuracy of emotion conversion and speaker similarity scores. From both objective and subjective evaluations, we confirm that the proposed method successfully converts the emotion of the target speakers,
outperforming the baselines w.r.t. emotion similarity, speaker similarity, and quality of the converted voices, while achieving decent naturalness.
Table 1: Subjective and objective evaluations results. MOS are shown for quality along with margin of error corresponding to 95% confidence interval.
Figure 2: ABX Subjective Evaluation for Emotion Similarity.
To learn more, click on the link below:
Demo Samples: https://demosamplesites.github.io/EVCUP/