2021 IEEE International Conference on Acoustics, Speech and Signal Processing

Technical Program

Paper ID	SPE-43.6
Paper Title	TOWARDS DATA SELECTION ON TTS DATA FOR CHILDREN'S SPEECH RECOGNITION
Authors	Wei Wang, Zhikai Zhou, Yizhou Lu, Hongji Wang, Chenpeng Du, Yanmin Qian, Shanghai Jiao Tong University, China
Session	SPE-43: Speech Recognition 15: Robust Speech Recognition 1
Location	Gather.Town
Session Time:	Thursday, 10 June, 16:30 - 17:15
Presentation Time:	Thursday, 10 June, 16:30 - 17:15
Presentation	Poster
Topic	Speech Processing: [SPE-ROBU] Robust Speech Recognition
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Recent researches on both utterance-level and phone-level prosody modelling successfully improve the voice quality and naturalness in text-to-speech synthesis. However, most of them model the prosody with a unimodal distribution such like a single Gaussian, which is not reasonable enough. In this work, we focus on phone-level prosody modelling where we introduce a Gaussian mixture model(GMM) based mixture density network. Our experiments on the LJSpeech dataset demonstrate that GMM can better model the phone-level prosody than a single Gaussian. The subjective evaluations suggest that our method not only significantly improves the prosody diversity in synthetic speech without the need of manual control, but also achieves a better naturalness. We also find that using the additional mixture density network has only very limited influence on inference speed.