2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-49.6
Paper Title	DENOISPEECH: DENOISING TEXT TO SPEECH WITH FRAME-LEVEL NOISE MODELING
Authors	Chen Zhang, Yi Ren, Zhejiang University, China; Xu Tan, Microsoft Research Asia, China; Jinglin Liu, Kejun Zhang, Zhejiang University, China; Tao Qin, Microsoft Research Asia, China; Sheng Zhao, Microsoft Azure Speech, China; Tie-Yan Liu, Microsoft Research Asia, China
Session	SPE-49: Speech Synthesis 7: General Topics
Location	Gather.Town
Session Time:	Friday, 11 June, 11:30 - 12:15
Presentation Time:	Friday, 11 June, 11:30 - 12:15
Presentation	Poster
Topic	Speech Processing: [SPE-SYNT] Speech Synthesis and Generation
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	While neural-based text to speech (TTS) models can synthesize natural and intelligible voice, they usually require high-quality speech data, which is costly to collect. In many scenarios, only noisy speech of a target speaker is available, which presents challenges for TTS model training for this speaker. Previous works usually address the challenge using two methods: 1) training the TTS model using the speech denoised with an enhancement model; 2) taking a single noise embedding as input when training with noisy speech. However, they usually cannot handle speech with real-world complicated noise such as those with high variations along time. In this paper, we develop DenoiSpeech, a TTS system that can synthesize clean speech for a speaker with noisy speech data. In DenoiSpeech, we handle real-world noisy speech by modeling the fine-grained frame-level noise with a noise condition module, which is jointly trained with the TTS model. Experimental results on real-world data show that DenoiSpeech outperforms the previous two methods by 0.31 and 0.66 MOS respectively.