Paper ID | SPE-43.6 |
Paper Title |
TOWARDS DATA SELECTION ON TTS DATA FOR CHILDREN'S SPEECH RECOGNITION |
Authors |
Wei Wang, Zhikai Zhou, Yizhou Lu, Hongji Wang, Chenpeng Du, Yanmin Qian, Shanghai Jiao Tong University, China |
Session | SPE-43: Speech Recognition 15: Robust Speech Recognition 1 |
Location | Gather.Town |
Session Time: | Thursday, 10 June, 16:30 - 17:15 |
Presentation Time: | Thursday, 10 June, 16:30 - 17:15 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-ROBU] Robust Speech Recognition |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Recent researches on both utterance-level and phone-level prosody modelling successfully improve the voice quality and naturalness in text-to-speech synthesis. However, most of them model the prosody with a unimodal distribution such like a single Gaussian, which is not reasonable enough. In this work, we focus on phone-level prosody modelling where we introduce a Gaussian mixture model(GMM) based mixture density network. Our experiments on the LJSpeech dataset demonstrate that GMM can better model the phone-level prosody than a single Gaussian. The subjective evaluations suggest that our method not only significantly improves the prosody diversity in synthetic speech without the need of manual control, but also achieves a better naturalness. We also find that using the additional mixture density network has only very limited influence on inference speed. |