Paper ID | SPE-33.4 | ||
Paper Title | CAMP: A TWO-STAGE APPROACH TO MODELLING PROSODY IN CONTEXT | ||
Authors | Zack Hodari, University of Edinburgh, United Kingdom; Alexis Moinet, Sri Karlapati, Jaime Lorenzo-Trueba, Thomas Merritt, Arnaud Joly, Ammar Abbas, Penny Karanasou, Thomas Drugman, Amazon, United Kingdom | ||
Session | SPE-33: Speech Synthesis 5: Prosody & Style | ||
Location | Gather.Town | ||
Session Time: | Thursday, 10 June, 13:00 - 13:45 | ||
Presentation Time: | Thursday, 10 June, 13:00 - 13:45 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-SYNT] Speech Synthesis and Generation | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a context-dependent prior over our prosodic space. Our context-aware model of prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by 26%. We also find that replacing attention with a jointly-trained duration model improves prosody significantly. |