Paper ID | SPE-15.1 |
Paper Title |
NOISE LEVEL LIMITED SUB-MODELING FOR DIFFUSION PROBABILISTIC VOCODERS |
Authors |
Takuma Okamoto, National Institute of Information and Communications Technology, Japan; Tomoki Toda, Nagoya University, Japan; Yoshinori Shiga, Hisashi Kawai, National Institute of Information and Communications Technology, Japan |
Session | SPE-15: Speech Synthesis 3: Vocoder |
Location | Gather.Town |
Session Time: | Wednesday, 09 June, 13:00 - 13:45 |
Presentation Time: | Wednesday, 09 June, 13:00 - 13:45 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-SYNT] Speech Synthesis and Generation |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Although diffusion probabilistic vocoders WaveGrad and DiffWave can realize real-time high-fidelity speech synthesis with a simple loss function in training, all noise components with full noise level range are predicted by one model in all iterations. This paper proposes a simple but effective noise level limited sub-modeling framework for diffusion probabilistic vocoders as Sub-WaveGrad and Sub-DiffWave. In the proposed method, DiffWave conditioned on continuous noise level as WaveGrad and spectral enhancement post-filtering are also provided. The proposed Sub-WaveGrad and Sub-DiffWave models are realized by using 10 sub-models. These models are separately trained with different limited noise levels, and only necessary sub-models are used according to the noise schedule in inference. The results of experiments using a Japanese female speech corpus indicate that both the proposed Sub-WaveGrad and Sub-DiffWave outperform vanilla WaveGrad and DiffWave in terms of the model accuracy and synthesis quality while keeping the inference speed. |